Keeping up with the BERTs
The most popular family in NLP town
If you even slightly follow the NLP world, or even the ML news you have most likely come across Google’s BERT model or one of its relatives. If you haven’t and still somehow have stumbled across this article, let me have the honor of introducing you to BERT — the powerful NLP beast.
What is BERT?
BERT stands for Bidirectional Encoder Representations from Transformers and is a language representation model by Google. It uses two steps, pre-training and fine-tuning, to create state-of-the-art models for a wide range of tasks.
Its distinctive feature is the unified architecture across different downstream tasks — what these are, we will discuss soon. That means that the same pre-trained model can be fine-tuned for a variety of final tasks that might not be similar to the task model was trained on and give close to state-of-the-art results.
As you can see, we first train the model on the pre-training tasks simultaneously. Once the pre-training is complete, the same model can be fine-tuned for a variety of downstream tasks. Note that a separate model is fine-tuned for a specific downstream task. So single pre-trained models can generate multiple downstream task specific models post fine tuning.
BERT Architecture
Simply put, it is a stack of Transformer’s Encoder. You can read about Transformers in details in my previous article. Or if you have some faint idea about it already, check out this absolutely bomb 3D diagram of the Encoder block used in BERT. Seriously you can’t miss this!
Now let’s look at some numbers that none of us will ever remember, but our understanding will feel incomplete without them, so here goes nothing:
L = Number of layers (i.e., #Transformer encoder blocks in the stack).
H = Hidden size (i.e. the size of q, k and v vectors).
A = Number of attention heads.
- BERT Base: L=12, H=768, A=12.
Total Parameters=110M! - BERT Large: L=24, H=1024, A=16.
Total Parameters=340M!!
What makes it Bidirectional?
We usually create a language model by training it on some unrelated task but tasks that help develop a contextual understanding of words in a model. More often than not such tasks involve predicting the next word or words in close vicinity of each other. Such training methods can’t be extended and used for bidirectional models because it would allow each word to indirectly “see itself” — when you would approach the same sentence again but from opposite direction, you kind of already know what to expect. A case of data leakage.
In such a situation, model could trivially predict the target word. Additionally, we can’t guarantee that the model, if completely trained, has learnt the contextual meaning of the words to some extent and not just focused on optimizing the trivial predictions.
So how does BERT manage to pre-train bidirectionally? It does so by using a procedure called Masked LM. More details on it later, so read on, my friend.
Pre-training BERT
The BERT model is trained on the following two unsupervised tasks.
1. Masked Language Model (MLM)
This task enables the deep bidirectional learning aspect of the model. In this task, some percentage of the input tokens are masked (Replaced with [MASK] token) at random and the model tries to predict these masked tokens — not the entire input sequence. The predicted tokens from the model are then fed into an output softmax over the vocabulary to get the final output words.
This, however creates a mismatch between the pre-training and fine-tuning tasks because the latter does not involve predicting masked words in most of the downstream tasks. This is mitigated by a subtle twist in how we mask the input tokens.
Approximately 15% of the words are masked while training, but all of the masked words are not replaced by the [MASK] token.
- 80% of the time with [MASK] tokens.
- 10% of the time with a random tokens.
- 10% of the time with the unchanged input tokens that were being masked.
2. Next Sentence Prediction (NSP)
The LM doesn’t directly capture the relationship between two sentences which is relevant in many downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI). The model is taught sentence relationships by training on binarized NSP task.
In this task, two sentences — A and B — are chosen for pre-training.
- 50% of the time B is the actual next sentence that follows A.
- 50% of the time B is a random sentence from the corpus.
Training — Inputs and Outputs.
The model is trained on both above mentioned tasks simultaneously. This is made possible by clever usage of inputs and outputs.
Inputs
The model needs to take input for both a single sentence or two sentences packed together unambiguously in one token sequence. Authors note that a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A [SEP] token is used to separate two sentences as well as a using a learnt segment embedding indicating a token as a part of segment A or B.
Problem #1: All the inputs are fed in one step — as opposed to RNNs in which inputs are fed sequentially, the model is not able to preserve the ordering of the input tokens. The order of words in every language is significant, both semantically and syntactically.
Problem #2: In order to perform Next Sentence Prediction task properly we need to be able to distinguish between sentences A and B. Fixing the lengths of sentences can be too restrictive and a potential bottleneck for various downstream tasks.
Both of these problems are solved by adding embeddings containing the required information to our original tokens and using the result as the input to our BERT model. The following embeddings are added to token embeddings:
- Segment Embedding: They provide information about the sentence a particular token is a part of.
- Position Embedding: They provide information about the order of words in the input.
Outputs
How does one predict output for two different tasks simultaneously? The answer is by using different FFNN + Softmax layer built on top of output(s) from the last encoder, corresponding to desired input tokens. We will refer to the outputs from last encoder as final states.
The first input token is always a special classification [CLS] token. The final state corresponding to this token is used as the aggregate sequence representation for classification tasks and used for the Next Sentence Prediction where it is fed into a FFNN + Softmax layer that predicts probabilities for the labels “IsNext” or “NotNext”.
The final states corresponding to [MASK] tokens is fed into FFNN+Softmax to predict the next word from our vocabulary.
Fine-tuning BERT
Fine-tuning on various downstream tasks is done by swapping out the appropriate inputs or outputs. In the general run of things, to train task-specific models, we add an extra output layer to existing BERT and fine-tune the resultant model — all parameters, end to end. A positive consequence of adding layers — input/output and not changing the BERT model is that only a minimal number of parameters need to be learned from scratch making the procedure fast, cost and resource efficient.
Just to give you an idea of how fast and efficient it is, the authors claim that all the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.
In Sentence Pair Classification and Single Sentence Classification, the final state corresponding to [CLS] token is used as input for the additional layers that makes the prediction.
In QA tasks, a start (S) and an end (E) vector are introduced during fine tuning. The question is fed as sentence A and the answer as sentence B. The probability of word i being the start of the answer span is computed as a dot product between Ti (final state corresponding to ith input token) and S (start vector) followed by a softmax over all of the words in the paragraph. A similar method is used for end span.
The score of a candidate span from position i to position j is defined as S·Ti + E·Tj, and the maximum scoring span where j ≥ i is used as a prediction
GPT — The distant cousin
Is BERT the only model that is producing these ground breaking results? No. Another model by OpenAI, called GPT has been making quite the buzz on internet.
But what many people don’t realize that these two models have something in common, that is both these model reuse a Transformer component. As stated earlier BERT stacks the encoder part of the Transformer as its building block. Meanwhile, GPT uses the decoder part of the Transformer as its building block.
Note that the bidirectional connections in BERT due to encoder’s bidirectional self-attention. Meanwhile, the connections in GPT are only in a single direction, from left-to-right, due to decoder design to prevent looking at future predictions — refer Transformers for more info.
The BERT Family
It wouldn’t be 21st century if we didn’t take something that works well and try to recreate or modify it. BERT architecture is no different. These are some of the most popular variants of it:
- ALBERT by Google and more — This paper describes parameter reduction techniques to lower memory reduction and increase the training speed of BERT models.
- RoBERTa by Facebook — This paper for FAIR believes the original BERT models were under-trained and shows with more training/tuning it can outperform the initial results.
- ERNIE: Enhanced Representation through Knowledge Integration by Baidu — It is inspired by the masking strategy of BERT and learns language representation enhanced by knowledge masking strategies, which includes entity-level masking and phrase-level masking.
- DistilBERT — Smaller BERT using model distillation from Huggingface.
You can check out more BERT inspired models at the GLUE Leaderboard.
Conclusion
- BERT is a stacked Transformer’s Encoder model.
- It has two phases — pre-training and fine-tuning.
- Pre-training is computationally and time intensive.
- It is, however, independent of the task it finally does, so same pre-trained model can be used for a lot of tasks.
- GPT is not that different from BERT and is a stacked Transformer’s decoder model.
- There are many variants of BERT out there.
References + Recommended Reads
- Transformers — if you want more in-depth knowledge on aforementioned encoder/decoder architecture.
- The paper — it is easy to read, and they also elaborate on the practical details a bit. Worth a read.
- Jay Alammar’s Blog.
- Official GitHub repo.
- More BERT Models.

