Understanding the BERT Model

Published in

Analytics Vidhya

11 min readSep 5, 2021

Bert is one the most popularly used state-of- the-art text embedding models. It has revolutionized the world of NLP tasks. In this blog we will start what Bert model is , how it is differs from other embedding models . We will then look into working of Bert and its configuration in details.

Topics To Be Covered :

Basic idea of Bert
Working of Bert
Configuration of Bert
Pre-training the Bert model
Pre- training procedure
Subword tokenization algorithms

Basic Idea Of Bert

Bert stands for Bidirectional Encoder Representation Transformer. It has created a major breakthrough in the field of NLP by providing greater results in many NLP tasks, such as question answering , text generation , sentence classification and many more besides . One of the major reasons of its success is that it is a context-based embedding model unlike any popular embedding model like word2vec which is a context-free.

First lets understand what is the difference between context-free and context-based model. Consider the following sentence

Sentence A: He got bit by a Python.

Sentence B: Python is my favorite programming language .

By reading both the sentences we can understand that the meaning of word “Python” is different in both the sentences . In sentence A the word “Python” refer to a snake , while in sentence B refers to a programming language.

Now if get embedding of the word “Python” using embedding model like word2vec we will get same embedding for both the sentences and therefore will render the meaning of the word in both sentences . This is because word2vec is a context-free model , it will ignore the context and give the same embedding for the word “Python” irrespective of the context.

Bert on the other hand , is a context-based model. It will understand the context and then generate the embedding for the word based on context . So, for the preceding two words it will give different embedding for the word “Python” .

But how does this work ? How does Bert understand context ?

Lets take sentence A , in this case Bert relates each word in the sentence to all the words in the sentence to get the contextual meaning of every word. By doing this Bert can understand that the word “Python” denotes the snake . Similarly the in sentence B Bert understands that the word “Python” denotes a programming language.

Now the question is how exactly Bert work ? How does it understand the context ?

Working Of Bert

Bert by name suggests is based on transformer model. To get a brief about what are transformer please refer to my previous blog about transformer and its working.

In Transformer we feed the sentence to as an input to the transformer’s encoder and it returns the representation of each word in the sentence as an output. Well that is what exactly what Bert is an — Encoder Representation Of Transformer but Bidirectional as Encoder of the Transformer is Bidirectional.

Once we feed the sentence as an input to the encoder , the encoder understands the context using multi-head attention mechanism.

Configuration of Bert

The researcher have presented the Bert in two main configuration

Bert-base
Bert-large

Bert-base — has 12 encoder layers stacked on one of top of the other, 12 attention heads and consist of 768 hidden units. The total number of parameters Bert-base is 110 million.

Bert-large — has 24 encoders layers stacked on one of top of the other, 16 attention heads and consist of 1024 hidden units. The total number of parameters Bert-large is 3400 million.

There are other configuration of Bert apart from two standard configurations such as Bert-mini, Bert-tiny , Bert-medium etc.

We can use smaller configurations of Bert in settings where computational resources are limited . However the standard giver more accurate results as they are most widely used.

Pre-training the Bert model

Pre-training a model means training a model with a huge dataset for a particular task and saved the trained model. Now for a new task , instead of initializing a new model with random weights , we will initialize the weights of our already trained model i.e pre-trained. Since the model is trained on a huge dataset , instead of training model from scratch for a new task we used the pre-trained model and adjust(fine-tune) its weights according to new task. This is a type of transfer learning.

Bert model is pre-trained on huge corpus using two interesting tasks called masked language modelling and next sentence prediction. For a new task lets say question answering we used the pre-trained Bert and fine tune its weights.

Input data representation

Before feeding the input to Bert , we convert input into embeddings using 3 embedding layer

Token embedding
Segment embedding
Position embedding

Token embedding

Let’s understand this by taking an example . Consider the following two sentences

Sentence A : Paris is a beautiful city.

Sentence B : I love Paris.

First we tokenized both the sentences and our output will be as follow

|tokens =[Paris, is , a , beautiful , city , I , love , Paris ]

then we add token a new token called [cls] in the beginning of the token

|tokens =[[cls] , Paris, is , a , beautiful , city , I , love , Paris ]

then we add [sep] token at the end of every sentence

|tokens =[[cls] , Paris, is , a , beautiful , city ,[sep] , I , love , Paris ]

The [cls] token is used for classification task whereas the [sep] is used to indicate the end of every sentence . Now before feeding the tokens to the Bert we convert the tokens into embeddings using an embedding layer called token embedding layer. Note that the value of the embedding will be learned during training.

Segment embedding

Segment embedding is used to distinguish between the two gives sentences.

Lets consider our previous example again.

|tokens =[[cls] , Paris, is , a , beautiful , city ,[sep] , I , love , Paris ]

Now apart from [sep] we have to give our model some sort of indicator to our model to distinguish between the two sentences . To do this we feed the input tokens to the segment embedding layer .

The segment embedding layer returns only either of the two embedding EA(embedding of Sentence A) or EB(embedding of Sentence B) i.e if the input token belongs to sentence A then EA else EB for sentence B.

Position Embedding

Since we are aware that the transformer does not use any recurrence mechanism and process all the words in parallel , we need to provide some information relating related to word order, so we used positional encoding.

We know that Bert is essentially the transformer’s encoder and so we need to give information about the position of the words in our sentence before feeding it directly to our Bert.

Final Representation

Now lets look at the final representation of the input data

WordPiece Tokenizer

Bert uses a special type of tokenizer called WordPiece tokenizer. The WordPiece tokenizer follows the subword tokenizer scheme . Lets understand WordPiece tokenizer , consider a sentence

“Let us start pretraining the model”

Now if we tokenize the sentence using wordpiece , then shall obtain

|token = [let , us , start , pre , ###train , ###ing , the , model]

while tokenizing the sentence , our word pretraining is splint into 3 parts , this happened because our word piece tokenizer first check whether the word is present in our vocabulary . If the word is present then it will used as a token but if not then our word is split into subwords recursively until the subwords are found in our corpus. This process is effective in handling the out of vocabulary words.

Pre Training Strategies

Bert Model is pre-trained on the following two task:

Masked language modeling
Next Sentence Prediction

Before diving directly in these two models lets first understand about language modeling.

Language Modeling

In language modeling task we train our the model to predict the next word given a sequence of words. We can categories the language modeling into two aspects:

Auto-regressive language modeling
Auto-encoding language modeling

Auto-regressive language modeling

we can categories auto-regressive language modeling as follows

forward (left to right) prediction
backward(right to left) prediction

Now consider our previous example “Paris is a beautiful city. I love Paris”. Let’s remove the word city and add a blank . Now , our model has to predict the blank. If we use forward prediction then our model reads all the words from left to right up o the blank in order to make a prediction.

Paris is a beautiful __.

but if we use backward prediction then our model reads all the words from right to left in order to make prediction

__ . I love Paris.

Thus auto regressive models are unidirectional , meaning they read the sentence in only one direction.

Auto-encoding language modeling

Auto-encoding language modeling takes advantage of both forward and backward prediction and thus we can say that auto-encoding model are bidirectional in nature. Reading the sentence in both directions gives much clarity about the sentence and hence will give better result. Bert is an auto-encoding language model.

Masked Language Modeling

In masked language modeling task for a given input , we randomly mask 15% of the word and train the network to predict the masked words . To predict the masked words our model reads in both the direction.

Let’s understand how masked language modeling works .

|tokens =[[cls] , Paris, is , a , beautiful , [Mask],[sep] , I , love , Paris ]

In our previous example we replace the word city with [Mask] token.

Masking token in this way will create a discrepancy between pre-training and fine-tuning which means that we train Bert by predicting the [Mask] token. After training , we can fine-tune the pre-trained Bert for downstream task such as sentiment analysis . But during fine-tuning we will not have any [Mask] token in the input which will cause a mismatch between the way Bert is pre-trained and how it is used for fine-tuning.

To overcome this issue we play 80–10–10% rule . we learned that we randomly mask 15 % of the sentence , now for these 15% we do the following:

for 80 % of the time we replace words with [Mask] token.
for 10 % of the time we replace the token with random token such our input will be as follow

[[cls] , Paris, is , a , beautiful ,[sep] ,love , I]

for 10 % of the time we don’t make any changes.

Following the tokenization and masking , we feed the input tokens to the token, segment and position embedding layers and get the input embeddings.

Now we feed our input embedding to Bert. Bert takes the input and return a representation of each token as output

To predict the masked token , we feed the representation of the masked token R[masked] returned by Bert to the feedforward with the SoftMax activation function. Now the feed forward network takes R[masked] as input and return the probability of the words from our vocabulary to be our word.

The masked language modeling task is also known as a cloze task. While masking input tokens we can also use slightly different method known as whole word masking.

Whole Word Masking

Consider the sentence “Let us start pretraining the model” , After using the WordPiece tokenizer we will get

|token = [let , us , start , pre , ###train , ###ing , the , model]

next we will add [Cls] token and mask 15 % of the word

|token = [[CLS],[Mask] , us , start , pre , [Mask], ###ing , the , model]

As we can see that we have masked a subword as per the part of word pretraining. In the Whole Word Masking is a sub word is masked then we masked all the words corresponding to the subword retaining our masked rate i.e 15%.

|token = [[CLS],let , us , start , [Mask], [Mask], [Mask], the , model]

Next Sentence Prediction(NSP)

NSP is a binary classification task in which we we feed two sentences to BERT and it has to predict whether the second sentence is the follow-up of the first sentence or not. By performing the NSP task our model an understand the relation between the two sentence. Understanding the relation between the two sentences is useful in the case of downstream task such as question — answering and text generation.

To perform classification we simply takes the representation of [CLS] token and feed it the feedforward network with SoftMax function, which then returns the probability of the sentence pair being isNext or notNext. The embedding of [CLS] basically holds the aggregate representation of all the tokens.

Pre-Training Procedure

Bert is pre-trained using Toronto BookCorpus and Wikipedia dataset. We are aware that Bert is pretrained using Masked language modeling and NSP task. Now how do we prepare dataset to trained Bert using these two tasks.

Lets consider two sentences

Sentence A : We enjoyed the game .

Sentence B : Turned the radio

Note- the sum of the total number of tokens from two sentences must be less than or equal to 512. While sampling the two text 50% time we will consider Sentence B as a follow-up of Sentence A and 50% time as not a follow-up.

After tokenization of sentence and adding both [CLS] and [SEP] token at the end of every sentence we get

|token -[[CLS],we, enjoyed ,the ,game,[SEP],turned the radio]

Next randomly masking 15% of the tokens according to 80–10–10 % rule.

|token -[[CLS],we, enjoyed ,the ,[Mask],[SEP],turned the radio]

Now we will train Bert with both masked modelling and NSP task.

Subword tokenization algorithms

Byte pair Encoding (BPE)

It involves the following steps

Extract the words from the given dataset along with their counts
Define the vocabulary size.
Split the words into characters sequence.
Add all the unique characters in our character sequence to the vocabulary.
Select and merge the symbol pair that has high frequency.
Repeat step 5 until the vocabulary size is reached.

Byte-level byte pair encoding (BBPE)

It is similar to BPE but instead of splitting words into character sequences , we split it into sequence of byte codes . It is very effective in handling OOV words and is great in sharing vocabulary across multiple language.

WordPiece

Like BPE we do the same in WordPiece but with one difference being that here we don’t merge symbol pairs based on frequency . Instead we merge symbol pairs based on likelihood.

Thank you..

Understanding the BERT Model

Written by bforblack