Bidirectional Encoder Representations from Transformers (BERT) is a
language model based on the
transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at
Google.[1][2] A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in
Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."[3]
BERT was originally implemented in the English language at two model sizes:[1] (1) BERTBASE: 12 encoders with 12 bidirectional self-attention heads totaling 110 million parameters, and (2) BERTLARGE: 24 encoders with 16 bidirectional self-attention heads totaling 340 million parameters. Both models were pre-trained on the Toronto
BookCorpus[4] (800M words) and
English Wikipedia (2,500M words).
Design
BERT is an "encoder-only"
transformer architecture.
At a high level, BERT consists of three modules:
Embedding: This module converts an array of
one-hot encoded tokens into an array of real-valued vectors representing the tokens. It represents the conversion of discrete token types into a lower-dimensional
Euclidean space.
Encoder stack: A sequence of Transformer encoder blocks. They perform transformations over the array of representation vectors, one of which is bi-directional
self-attention.
Un-embedding/decoder: This module converts the final representation vectors into one-hot encoded tokens again by producing a predicted probability distribution over the token types. It can be viewed as a simple decoder, decoding the latent representation into token types.
The decoder module is necessary for pre-training, but it is often unnecessary for so-called "downstream tasks," such as
question answering or
sentiment classification. Instead, one removes the decoder module and replaces it with a newly initialized module suited for the task. The latent vector representation of the model is directly fed into this new module, allowing for sample-efficient
transfer learning.
BERT uses WordPiece tokenization, a sub-word strategy like
byte pair encoding, for conversion of tokens to unique integer codes. BERT uses a vocabulary size of 30,000 with any token not appearing in its vocabulary replaced by [UNK] for "unknown."
Pre-training
BERT was pre-trained simultaneously on two tasks:[5]
Masked Language Modeling (MLM): 15% of tokens were selected for prediction, and the training objective was to predict the selected token given its context. The selected token is
replaced with a [MASK] token with probability 80%,
replaced with a random word token with probability 10%,
not replaced with probability 10%.
For example, the sentence "my dog is cute" may have the 4-th token selected for prediction. The model would have input text
"my dog is [MASK]" with probability 80%,
"my dog is happy" with probability 10%,
"my dog is cute" with probability 10%.
After processing the input text, the model's 4-th output vector is passed to its decoder layer, which outputs a probability distribution over its 30,000-dimensional vocabulary space.
Next Sentence Prediction(NSP): Given two spans of text, the model predicts if these two spans appeared sequentially in the training corpus, outputting either [IsNext] or [NotNext]. The first span starts with a special token [CLS] (for "classify"). The two spans are separated by a special token [SEP] (for "separate"). After processing the two spans, the 1-st output vector (the vector coding for [CLS]) is passed to a separate neural network for the binary classification into [IsNext] and [NotNext].
For example, given "[CLS] my dog is cute [SEP] he likes playing" the model should output token [IsNext].
Given "[CLS] my dog is cute [SEP] how do magnets work" the model should output token [NotNext].
As a result of this training process, BERT learns
latent representations of tokens and text in context. After pre-training, BERT can be
fine-tuned with fewer resources on smaller datasets to optimize its performance on specific tasks such as
natural language inference and
text classification, and sequence-to-sequence-based language generation tasks such as
question answering and conversational response generation.[1] The pre-training stage is significantly more
computationally expensive than fine-tuning.
Architecture details
This section describes BERTBASE. The other one, BERTLARGE, is similar, just larger.
The first layer is the embedding layer, which contains three components: token type embeddings, position embeddings, and segment type embeddings.
Token Types: The token type is a standard embedding layer, translating a one-hot vector into a dense vector based on its token type.
Positions: The position embeddings are based on a token's position in the sequence. BERT uses absolute position embeddings, where each position in sequence is mapped to a real-valued vector. Each dimension of the vector consists of a
sinusoidal function that takes the position in the sequence as input.
Segment Types: Using a vocabulary of just 0 or 1, this embedding layer produces a dense vector based on whether the token belongs to the first or second text segment in that input. In other words, type-1 tokens are all tokens that appear after the [SEP] special token. All prior tokens are type-0.
The three embedding vectors are added together representing the initial token representation as a function of these three pieces of information. After embedding, the vector representation is normalized using a LayerNorm operation, outputting a 768-dimensional vector for each input token.
After this, the representation vectors are passed forward through 12 Transformer encoder blocks, and are decoded back to 30,000-dimensional vocabulary space using a basic affine transformation layer.
The reasons for BERT's
state-of-the-art performance on these
natural language understanding tasks are not yet well understood.[8][9] Current research has focused on investigating the relationship behind BERT's output as a result of carefully chosen input sequences,[10][11] analysis of internal
vector representations through probing classifiers,[12][13] and the relationships represented by
attention weights.[8][9]
The high performance of the BERT model could also be attributed to the fact that it is bidirectionally trained. This means that BERT, based on the Transformer model architecture, applies its self-attention mechanism to learn information from a text from the left and right side during training, and consequently gains a deep understanding of the context. For example, the word fine can have two different meanings depending on the context (I feel fine today, She has fine blond hair). BERT considers the words surrounding the target word fine from the left and right side.
However it comes at a cost: due to encoder-only architecture lacking a decoder, BERT can't
be prompted and can't
generate text, while bidirectional models in general do not work effectively without the right side,[clarification needed] thus being difficult to prompt, with even short text generation requiring sophisticated computationally expensive techniques.[14]
In contrast to deep learning neural networks which require very large amounts of data, BERT has already been pre-trained which means that it has learnt the representations of the words and sentences as well as the underlying semantic relations that they are connected with. BERT can then be
fine-tuned on smaller datasets for specific tasks such as sentiment classification. The pre-trained models are chosen according to the content of the given dataset one uses but also the goal of the task. For example, if the task is a sentiment classification task on financial data, a pre-trained model for the analysis of sentiment of financial text should be chosen. The weights of the original pre-trained models were released on
GitHub.[15]
History
BERT was originally published by Google researchers
Jacob Devlin,
Ming-Wei Chang,
Kenton Lee, and
Kristina Toutanova. The design has its origins from pre-training contextual representations, including
semi-supervised sequence learning,[16] generative pre-training,
ELMo,[17] and ULMFit.[18] Unlike previous models, BERT is a deeply bidirectional,
unsupervised language representation, pre-trained using only a plain
text corpus. Context-free models such as
word2vec or
GloVe generate a single word embedding representation for each word in the vocabulary, whereas BERT takes into account the context for each occurrence of a given word. For instance, whereas the vector for "running" will have the same word2vec vector representation for both of its occurrences in the sentences "He is running a company" and "He is running a marathon", BERT will provide a contextualized embedding that will be different according to the sentence.[citation needed]
On October 25, 2019,
Google announced that they had started applying BERT models for
English languagesearch queries within the
US.[19] On December 9, 2019, it was reported that BERT had been adopted by Google Search for over 70 languages.[20] In October 2020, almost every single English-based query was processed by a BERT model.[21]
A later paper proposes RoBERTa, which preserves BERT's architecture, but improves its training, changing key hyperparameters, removing the next-sentence prediction task, and using much larger mini-batch sizes.[22]
^
abKovaleva, Olga; Romanov, Alexey; Rogers, Anna; Rumshisky, Anna (November 2019).
"Revealing the Dark Secrets of BERT". Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 4364–4373.
doi:
10.18653/v1/D19-1445.
S2CID201645145.
^Khandelwal, Urvashi; He, He; Qi, Peng; Jurafsky, Dan (2018). "Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics: 284–294.
arXiv:1805.04623.
doi:
10.18653/v1/p18-1027.
S2CID21700944.
^Gulordava, Kristina; Bojanowski, Piotr; Grave, Edouard; Linzen, Tal; Baroni, Marco (2018). "Colorless Green Recurrent Networks Dream Hierarchically". Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 1195–1205.
arXiv:1803.11138.
doi:
10.18653/v1/n18-1108.
S2CID4460159.
^Giulianelli, Mario; Harding, Jack; Mohnert, Florian; Hupkes, Dieuwke; Zuidema, Willem (2018). "Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information". Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics: 240–248.
arXiv:1808.08079.
doi:
10.18653/v1/w18-5426.
S2CID52090220.
^Patel, Ajay; Li, Bryan; Mohammad Sadegh Rasooli; Constant, Noah; Raffel, Colin; Callison-Burch, Chris (2022). "Bidirectional Language Models Are Also Few-shot Learners".
arXiv:2209.14500 [
cs.LG].