BERT

Shaurya Goel
7 min readJul 22, 2019

--

We will discuss BERT in this article. I assume that you know Transformers.

A transformer is a self-attention model to process sequential input like what RNN does but parallelly.

If we randomly initialise a DL model to solve an NLP task, we need large amounts of data and time. This is because our model has to learn the language (grammar, semantics etc) itself and our specific task too. This might give suboptimal results. If we can first learn the semantics of our language and then fine-tune the model for our specific tasks (by adding an output layer), we can achieve better results. It is like using a pre-trained vision model pre-trained on ImageNet. This is what BERT does.

BERT (Bidirectional Encoder Representations from Transformers) is basically a pre-trained language model which can be fine-tuned for specific NLP tasks like question answering, semantic analysis, etc. The pre-trained model can we downloaded from authors github[2]. The pre-trained model is available with a different number of transformer layers, hidden dim and number of attention heads. Each language has a different pre-trained model. Authors achieved SOTA in 11 NLP tasks including GLUE, MultiNLI, SQuAD v1.1, SQuAD v2.0 using same pre-trained model and then fine-tuning according to the specific tasks.

Most people would not have to pre-train BERT. Pre-training takes a lot of time and is resource intensive. Most people would only need to fine-tune the pre-trained model which only take a few hours on a GPU.

What does each letter in BERT mean?

B- Bidirectional- the proposed model uses context from both directions

ER- Encoder Representation- the model uses encoder part of the Transformer to learn a representation for each token

T- Transformer- the model uses Transformers instead of RNNs to process text

Pre-Training

We pre-train BERT in an unsupervised way. Training data is anything which has meaningful sentences in our desired language. Wikipedia and BooksCorpus datasets are a great example of datasets with many English sentences.

There are two unsupervised methods-

Masked language model (MLM)

Given a text, we randomly mask some percentage of input words (15% in the paper) and predict the masked token(s) using its context only. These masked token(s) are predicted using hidden representation from the final layer. This task is also called Cloze task. E.g.-

My dog is hairy -> My dog is [MASK]

So, why do we need masking? If there was no masking, a multi-layer transformer can trivially see the actual word and learn to predict that word using no context.

But, now another problem arises- [MASK] token does not appear during fine-tuning, creating a mismatch between pre-training and fine-tuning. To mitigate this, we don’t always replace selected words with a masked token. 80% of the time we use masked token, 10% of the time use random token and 10% of the time use original token. E.g.-

My dog is hairy -> My dog is [MASK] (80% of the time)

My dog is hairy -> My dog is apple (10% of the time)

My dog is hairy -> My dog is hairy (10% of the time)

Another advantage of masking is that the transformer doesn’t know which words will it be asked to predict (at the final layer). So, it forces the transformer to remember the context representation for every input token.

Many tasks such as Question Answering(QA) and Natural Language Interface(NLI) require understanding the relationship between two sentences which is not captured by a language model. So, we use the NSP task.

In this task, we are given two sentences (A and B) and are asked to predict whether B follows A. Training data can be generated trivially. 50% of the time B follows A in the corpus and 50% of the time B is a random sentence from the corpus.

Both of the above tasks (MLM and NSP) are combined, and the model is trained with a combined loss function. E.g.-

([SEP] token is used to separate two sentences. [SEP] and [CLS] tokens are discussed in the next section)

[CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] => should output high probability for NSP task.

[CLS] the man [MASK] to the store [SEP] penguin [MASK] are flightless birds [SEP] => should output low probability for NSP task.

Input and Output representations

To handle various NLP tasks, BERT has to handle both a single sentence input and a pair of sentences (e.g. <Question, Answer>) in a single sequence.

A “sentence” refers to an arbitrary span of contiguous text, not necessarily an actual linguistic sentence.

A “sequence” refers to one or two sentences packed together.

For each token, we use WordPiece embeddings[5]. Vocabulary size=30,000

[CLS]-first token of every sequence. Final hidden state representation © of this token is used as the sequence representation for classification tasks.

[SEP]- used to separate two sentences.

Two examples are given at the end of the previous section.

BERT Pre-Training

E- input embeddings

Tᵢ/Tᵢ’- Final hidden representation of token i of sentence A/B

Tᵢ’s are used for MLM task

C is used for NSP task (C is not a meaningful sentence representation without fine-tuning, as it was trained only on NSP task)

Both C and Tᵢ ∈ ℝʰ where h is the size of the hidden dimension.

Input embedding=Token embedding+Segment embedding+Position embedding

Input Representation

Segment embedding is used to distinguish between the two sentences. Embedding corresponding to each sentence (Eₐ or E_b) is added to each token embedding of that sentence.

Position embedding is used to capture the absolute position of each input (like the one used in the original Transformer Paper).

Embeddings of Eₐ, E_b, [CLS] and [SEP] are learned.

Pre-Training

  1. Both of tasks (MLM and NSP) have a combined loss function.

Total Loss = sum of mean masked LM likelihood + mean NSP likelihood

2. This paper uses GELU activation[4] instead of ReLU.

3. Authors trained two BERT models- BERT_BASE and BERT_LARGE.

BERT_BASE- L=12, H=768, A=12, Total parameters=110M

BERT_LARGE- L=24, H=1024, A=16, Total Parameters=340M

L- number of layers (transformer blocks)

H- hidden size

A- number of self-attention heads

BERT_LARGE is better than BETTER_BASE and beats all SOTA results on 11 NLP tasks.

Fine-Tuning

For each task, we give task-specific inputs and outputs to BERT and fine-tune all parameters end-to-end.

At the input, sentence A and B from pre-training are analogous to-

  1. sentence pairs in paraphrasing
  2. hypothesis-premise pair in entailment
  3. question-passage pair in QA
  4. degenerate text-ϕ pair (single sentence) in text classification or sequence tagging

At the output-

  1. token representations are fed to an output layer for token level tasks (e.g. sequence tagging or QA)
  2. [CLS] representation is fed to an output layer for classification tasks (e.g. such as entailment or sentiment analysis)

This output layer is the only layer added to the model for fine-tuning and is learned from scratch.

Below are the ways to fine-tune pre-trained BERT on different tasks-

GLUE

General Language Understanding Evaluation (GLUE) is a collection of 9 diverse NLP tasks. All of them are classification tasks with a single sentence or sentence pairs. So, we can use C (hidden representation of [CLS] token) as the aggregate representation and use this for classification. We introduce an output feed-forward layer of dimension K x H, where K=vocabulary size. We use log of softmax as classification loss.

SQuAD v1.1

Stanford Question Answering Dataset (SQuAD) is a collection of thousands of question/answer pairs. Given a question and a passage (from Wikipedia) we have to predict the answer span in the passage. Span here means that the answer should be a continuous piece of text from the passage which is the answer to the question. So, we have to predict the start and end positions of that text only. Performance on this task is calculated using the F1 score.

We introduce start and vectors, S and E ∈ ℝʰ during fine-tuning. The probability of word i being the start of the answer span is dot product between S and Tᵢ followed by softmax over all words in the paragraph. Similarly, we can find the probability of the end of answer span.

The score of a candidate span from position i to position j is defined as S⋅Tᵢ+E⋅Tⱼ and the maximum scoring span where j≥i is the prediction. The training objective is the sum of log-likelihoods of the correct start and end positions.

SQuAD v2.0

Same as SQuAD v1.1, but now it is possible that no answer exists.

We treat questions that don’t have an answer span as having an answer span that starts and ends at the [CLS] token. The score of no-answer span=S⋅C+E⋅ C.

For prediction, we compare the score of no answer span and best non-null span. If the best non-null span score is greater than no answer span score by a threshold, then we predict the non-null answer span. This threshold is selected on the validation set to maximise the F1 score.

SWAG

Situations With Adversarial Generations (SWAG) has thousands of sentence pair examples. Given a sentence, the task is to choose the most plausible continuation among four choices.

During fine-tuning, we construct four input sequences, each containing the given sentence (sentence A) and each of the four choices (sentence B). A new vector is introduced whose dot product with C gives a score for each of the four choices.

Feature-Based Approach

So far we have only considered fine-tuning the pre-trained BERT model. Sometimes fine-tuning is not always possible using the Transformer architecture. In such cases, we can extract the features learned by BERT (from pre-training). These features contain high-level representations of the input and can still be useful (without fine-tuning) for various tasks. Best performance is achieved when the last four hidden layers of the pre-trained Transformer are concatenated.

--

--