BERT: A deeper dive

Moving ahead of Transformers

Published in

Data Science in your pocket

10 min readMay 11, 2021

After covering Transformers & attention mechanisms in my previous post, this time, I came down to BERT which is more of an evolved version of a transformer & a breakthrough model in NLP.

Note: You might need to go back to my previous post for a better understanding

Starting with the most obvious question, What new BERT brings?

It eliminates the Decoder (used in transformers) completely. Hence, just Encoder is considered
This Encoder is trained so as to achieve bidirectional training over the training dataset, unlike the transformer where training is just unidirectional (will discuss this).
The end goal is to get a model that is able to understand ‘Language models’ rather than particular tasks which can be, using transfer learning, can be used for many tasks.
The tokenizer used i.e. WordPiece to convert text sequence into tokens explained in my previous post

All the points are fine. But the 3rd one just bounced off. What does it mean?

It means to create a Human-like understanding of the model for a particular language. So any problem in that language, be it Question Answering, Complete sentences, Sentiment Analysis, etc. can be done using just that one model with a few tweaks. This is unlike a transformer where for every problem, the training has to be done again & again.

So let’s understand what changes have been brought to the famous transformer to make it more of a generalized model for any task

The Structure

It uses the same Encoder structure as in transformers with a larger number of

Multi Heads = 12 which was 8 in transformers
Embedding dimension=784 which was 512 in transformers
12 repeated blocks for (Multi-Head Attention, Normalization, FFN) which was 6 in transformers

This is called as BERT-Base

Even a bigger model called BERT-Large exists with

16 Multi-Heads
1024 as embedding dimension
24 repeated blocks (Multi-Head Attention, Normalization, FFN)

There exist many other configs for BERT with changing numbers of Multi-Heads, Embedding dimensions, or repeated blocks. The below structure can act as a dummy for all BERT versions.

R_vectors are the output embedding for each input sequence token (with attention embedded). It's just a different representation of the Attention Matrix output in the Transformer’s Encoder

Hence, in BERT, as mentioned, the core architecture remains the same as the transformer though it goes bigger with more Multi-Head Attention, increased repeated block (Attention, Normalize, FFN, Normalize) count i.e. ’N’ & larger embedding size i.e. input embedding of tokens & output embedding matrix of tokens. This BERT core remains common for all the below-mentioned tasks.

So we must jump on the most important concepts :

Pre-training & Fine-tuning.

As mentioned earlier, BERT comes with an aim to learn ‘language models’ & no specific problem for the sake of generalization. For any specific problem in a certain language, minor tweaks are required.

Pre-training refers to the training where BERT learns the ‘language model’ of a particular language while Fine-tuning is the training done for any specific task (say Q&A or classification or any XYZ task) using Transfer Learning. Hence, pre-training is something that has to be done once in a lifetime. And according to different problems, the same model can be fine-tuned. The major savings is on hardware resources & latency as the model doesn’t need to learn all features from scratch for every problem (as already pre-trained) but a few new ones according to the problem.

Let’ see how BERT learns the so-called ‘Language model’ !!

Pretraining

If you remember in transformers, we use to add a Position vector alongside token embeddings learned to improve them. In BERT, a few more alterations are done in the input embedding

Any text sequence was broken in tokens using WordPiece Tokenizer which I have explained in one of my previous posts.
Addition of CLS & SEP tokens. A CLS token is added to the start of any input sequence while a SEP token to after the end of each sentence in the input sequence(and not just the end of the sequence). A few examples may clear the air:

[CLS] He is a good boy [SEP] He is a star footballer [SEP]
[CLS] They have been together [SEP]

Segment Embedding. A segment embedding is of the same length as the input sequence which helps in differentiating 2 sentences in one input sequence. So in our 1st example in the above point, we can have a segment embedding like [a, a, a, a, b, b, b, b, b] which has the same values for tokens of one sentence & different for tokens of another sentence in input. If the input has just one sentence, segment embedding will be the same for all tokens i.e. [a, a, a, a] in the 2nd example above. Here ‘a’ & ‘b’ can be any constants.
The final embedding is the sum of ‘token embedding’ (learned through the Input Embedding layer), Positional & segment embedding as shown below.

Final embedding for sample input ‘Paris is a beautiful city. I love Paris’. Observe Segment Embedding being the same for tokens of the 2 sentences. Also, observe how CLS & SEP tokens are inserted

We will soon come to know why the extra tokens & segment encoding is required.

So, now we have 3 embeddings with a few extra tokens per input sequence.

Pre-training BERT includes training the model on two problems

Masked Language Modeling (MLM)
Next Sequence Prediction (NSP)

It must be kept in mind the above-mentioned alterations are common for both MLM & NSP. Let’s dive into MLM

Masked Language Modeling (MLM)

The 1st pre-training task revolves around detecting ‘masked’ tokens in a given input sequence

Preparing training data

Random tokens are masked which are to be detected by the model as shown in the below examples. Multiple tokens may get masked

I am [MASKED] boy
She loved [MASKED] with chocolate [MASKED]
She is at[MASKED]. I just texted me.

The above ‘masking’ (hiding any random token from input) was done for 80% of the total input sequence. Of the remaining 20%, the below alterations were done in equal proportions

Replace any token with any random token from vocabulary (10%)

He is ‘cake’ best friend,
‘Bus’ love each other

No alterations in the remaining 10% of the input sequence

He is my best friend,
They love each other

Hence, the model can get 3 types of inputs

Having one token masked at any position (80%)
Random token at any position (10%)
No changes (10%)

What is the significance of Masked Modeling Language?

We train BERT with such masked input so as for bidirectional training of the model, unlike transformers where we are always predicting the end token of the sequence. The biggest problem with this (the transformers one) approach is the model is trained in a unidirectional way i.e. either left-right or right-left as the tokens after the token which has to be predicted are always masked in the decoder (refer to my previous post) but this isn’t the case with BERT. As a random token is masked & has to be predicted, the rest of the tokens are available & hence a bidirectional training happens i.e. the model can take context from both sides together for this token prediction. For example:

If we have a sentence

I am a boy. I like playing Cricket & if we need to predict say the 4th token i.e. ‘boy’, then the input sequence visible to

Transformer: I am a ___
BERT: I am a ___. I like playing Cricket

Hence, tokens that are to the right of the missing token that has to be predicted are also made available to the model for a better prediction

Why do we need 3 types of different input sequences (as mentioned above: masked, random token & no change)?

As mentioned earlier, Pre-Training is to learn the ‘Language model’ but when we need to use this model for specific tasks like sentiment analysis, or summarization where tweaks are required i.e. the fine-tuning tasks, this masking can be troublesome as if we pre-train it on just ‘MASK’ tokens, the model will overfit & may perform poorly where we don’t give a masked token (as in sentiment analysis). Hence, the 3 types of input sequences are used for training keeping fine-tuning in mind.

MLM model Structure

Do keep in mind the BERT core discussed above !!

For MLM training, a few more layers are added on top of this Encoder structure: Feed Forward Network intaking embedding of ‘Masked’ token only as input & a SoftMax activation predicting the most appropriate token according to the highest probability predicted

Example for MLM model training. Here, the aim is to detect the [MASK] token using other visible tokens

Let's run through the above example:

1. “Paris is a beautiful city. I love Paris” is taken as input
2. The input text sequence is tokenized using WordPiece
3. ‘city’ is masked & [CLS] & [SEP] tokens are added.
4. The 3 embeddings Token, Positional & Segment are generated & added together
5. This merged embedding fed to BERT-Base (i.e. Encoder constituting repeated blocks=12, BERT Core structure discussed above)
6. Attention vector generated for each token
7. Embedding for R_mask(attention vector for masked token) is fed to feedforward network & masked token is predicted (observe probability for city=0.9 which is highest amongst all other tokens)

And here, we end MLM

Next Sequence Prediction

The other task considered in Pretraining is NSP where we train a binary classifier using BERT with an aim to know if two merged sentences are related or not.

Say we have 3 sentences

He is handsome
The sun rises in the east
His name is ‘Raman’

Now, if we form 2 input sequences using the above sentences say:

He is handsome. His name is ‘Raman’
The sun rises in the east. His name is ‘Raman’

In NSP, the target is to predict whether the 2nd sentence in the input sequence has any association with the 1st. Hence it should predict ‘Yes’ for the 1st input sequence but ‘No’ for the 2nd input.

As we are clear with the problem for which the BERT model has to be trained after the MLM model (Transfer learning so as the model doesn’t lose out on information learned in the 1st training session) & don’t start fresh, we must now know how to prepare training data for this problem generated

Preparing training dataset

This isn’t tough. What we can do is take up 2 random documents say (just examples !!):

The thirsty crow
An essay on ‘My Best Friend’

Now, in equal proportion

Join two consecutive sentences from either document & mark ‘Yes’
Take two random sentences one each from the 2 documents & mark ‘No’

The same changes in embedding & extra tokens will be brought in here as well before feeding training data to BERT as in MLM training which is

Token + Positional +Segment Encoding
CLS & SEP tokens

NSP Model Structure

NSP training with ‘She cooked pasta. It was delicious’ as input. Do observe isNext token=0.9 for this input

As in MLM, even in NSP, the BERT Core remains the same as discussed earlier but an FFN alongside a SoftMax is applied on top of it intaking embedding for ‘CLS’ token predicting is the 2nd sentence in the input sequence associated with the 1st sentence using ‘isNext’ tag. The figure is self-explanatory.

Also, the addition of Segment Embedding & extra tokens now makes sense as they play a crucial role in NSP training as

CLS token helps in NSP binary classification. It is assumed that CLS’s embedding incorporates a numeric representation of the entire input sequence & hence is a very useful token when it comes to tasks like text classification.
SEP marks where a new sentence starts within the input sequence
Segment embedding helps in distinguishing the tokens of multiple sentences in the input sequence

So, this concludes our pre-training tasks & now your BERT core should be able to understand a particular language model. So, for any specific task to perform using BERT, we need to fine-tune where a few layers may be required over BERT (as we added in MLM & NSP) & train.

Before ending, I must brief a bit on Fine-Tuning

Fine Tuning

Fine-tuning BERT for any task may sound similar to training BERT for the above pre-training tasks which are MLM & NSP. So, if we wish out BERT to perform Sentiment Classification, what we may need to do is

Add FFN over BERT (same as in MLM & NSP training)
Feed CLS token output embedding to FFN

Just a question, does Fine-tuning task training the same as pretraining (MLM & NSP training)?

There exists a stark difference,

While pretraining (MLM & NSP), weights in BERT core are trained alongside FFN weights & aren’t frozen but while Fine-tuning, just FFN (or any additional layers weights are trained that are added over BERT) are trained but not BERT-Base. Hence, any weights in BERT are frozen as soon as pre-training is done.

The below figure shows how pretraining was done for ‘Sentiment Analysis’ using pre-trained BERT. No updates in Pre-trained BERT’s weights will be done