A Light Introduction to BERT

constanza fierro

Published in

DAIR.AI

6 min readMay 11, 2020

Pre-training of Deep Bidirectional Transformers for Language Understanding (link)

Why is it important?

BERT is a language model that can be used directly to approach other NLP tasks (summarization, question answering, etc.).
It proved to be consistently better than the other language models proposed at the time, and it has been shown to be really good in many other tasks afterward (cited 4.832 times as of today).
It has been shown that it’s capable of learning actual linguistic notions (Clark et., al 2019, summary here).

It achieves this with just a couple of changes to the transformer architecture together with new tasks for pre-training.

What does it propose?

This paper proposed a new way of thinking about language models presenting 2 new pre-training tasks. Also, BERT token representations outperform others because they are learned using both the left and right context of the sentence.

How does it work?

BERT is basically the transformer architecture (Vaswani et. al 2017, summary here) trained to learn language representations, and conceived to be used as the main architecture to NLP tasks. It mainly differs from the precedent language models because its learned representations contain context from both sides of the sentences (left and right from the word itself). Let’s now understand the training proposed to learn these language representations and then how they are directly used for other NLP tasks.

BERT training

BERT is trained using two objectives:

Some tokens from the input sequence are masked and the model learns to predict these words (Masked language model).
Two “sentences” are fed as input and the model is trained to predict if one sentence follows the other one or not (next sentence prediction NSP).

So we’ll feed BERT with two sentences masked, and we’ll obtain the prediction whether they’re subsequent or not, and the sentences without masked words, as Figure 1 shows.

Figure 1: Example of input and output for two masked sentences.

Masked language model

From each input sequence 15% of the tokens are processed as follows:

with 0.8 probability the token is replaced by [MASK]
with 0.1 probability the token is replaced by another random token
with 0.1 probability the token is unchanged

The motivation of this setup is that if we always replace 15% of the input sequence with [MASK] then the model will be learning just to detect the masked word and not actual word representations, besides the downstream tasks won’t have [MASK] tokens.

Next sentence prediction

The input is composed of two sentences, which are not actual linguistic sentences but are spans of contiguous text. These two sentences A and B are separated with the special token [SEP] and are formed in such a way that 50% of the time B is the actual next sentence and 50% of the time is a random sentence.

BERT input

As the Figure 2 describes, the input sequence of BERT is composed by two sentences with a [SEP] token in between, and the initial “classification token” [CLS] that will later be used for prediction. Each token has a corresponding embedding, a segment embedding that identifies each sentence, and a position embedding to distinguish the position of each token (same as the positional encoding in the Transformer paper). All these embeddings are then summed up for each token.

The datasets used were: BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words).

Tokens construction

The tokens are defined using the Wordpiece model (Schuster et al., 2012). This is similar to BPE (summary here) but instead of adding the pair of tokens that’s the most frequent in the corpus, we’ll add the pair that increases the likelihood of a language model over the vocabulary. More formally put, the likelihood of the LM is the product of the likelihood of each word in the corpus:

So the pair that will increase the most this likelihood will be the tᵢtⱼsuch that:

For the LM* that will include that pair in its vocabulary.

BERT output

The final layer of BERT contains a token representation Tᵢ and the classifier embedding C, then each Tᵢ is used to predict whether the token was masked or not and the C representation to predict if the two sentences were contiguous or not.

BERT in some NLP tasks

BERT was conceived to be used directly in other NLP tasks, that is, not to be used as input to other task-specific architecture, but instead to be the architecture of the task. In general, we feed the input of the task into BERT, add a layer at the end to convert the prediction into the task-specific answer, and then fine-tune all the parameters end-to-end. This fine-tuning is relatively inexpensive (1hr on a TPU, a couple of hours on a GPU).

In general, the input of an NLP task is either a text pair (Q&A) that can be directly used as we saw before or just one text (text classification) in which case we can set as input the pair text-∅. At the output, we feed the last layer representations into the output layer, we either:

1. Feed the token representations Tᵢ

Into an output layer for token-level tasks, such as sequence tagging or question answering.

This output layer can differ from task to task, but in general, it’ll be a computation from one/more added vectors and each Tᵢ. For example for a Q&A task where the input is a question and a passage that contains the answer, this output layer can be composed by two vectors S and E (start and end embeddings), that we use to calculate the dot product with each token representation to obtain the score of a candidate phrase with S⋅Tⱼ+E⋅Tⱼ. Therefore we use as loss the sum of the log-likelihoods of the correct start and end positions.

2. Feed the [CLS] representation C

Into an output layer for classification, such as sentiment analysis.

This output layer can be basically a softmax over a transformation matrix W that takes the output C to the labels dimension (K), i.e.:

And then we can use as loss the logarithm of that softmax.

For more examples on how to define the output of BERT for a specific task, you can check the paper!

What’s next!

Personally, I think that BERT is a great example to notice that letting the model learn where to attend and how to represent all the input outperforms strategies like forcing the model to learn from right-to-left, or in Q&As setup to encode the question and the paragraph separately and then concatenate them. Therefore, we will probably achieve better results if we let the model define how to represent the input, and just feed it with all the information and then define the loss and prediction that we want to do.