NLP — BERT & Transformer

Google published an article “Understanding searches better than ever before” and positioned BERT as one of its most important updates to the searching algorithms in recent years. BERT is a language representation model with impressive accuracy for many NLP tasks. If you understand better what people ask, you give better answers. Google says 15% of the Google queries are never seen before. The real issue is not on what people ask. Instead, it is how many ways a question may be asked. Previously, Google search was keyword-based. But this is far from understanding what people ask or clarifying the ambiguity in human language. That is why Google switches from keyword to BERT in their search engine. In the example below, BERT understands the intentions of “Can you get medicine for someone pharmacy” better and returns more relevant results.

In the last article, we cover the word embedding and let’s move on to BERT now. Strictly speaking, BERT is a training strategy, not a new architecture design. To build a whole system with this concept, we need to study another proposal from Google Brain first. That is the Transformer. The concept is complex and will take some time to explain. BERT just need the encoder part of the Transformer. For completeness, we will cover the decoder also but feel free to skip it according to your interests.
Encoder-Decoder & Sequence to Sequence
To map an input sequence to an output sequence, we can apply sequence-to-sequence learning using an encoder-decoder model, like seq2seq. Such a model has many applications. For example, it can translate a sentence in one language to another language. This sequence model can also apply to text summarization, image captioning, conversational modeling and many other NLP tasks.

We will move to areas that are more important for us now. But if you need further information, Google the phrase “sequence to sequence” or “Seq2Seq” later.
For many years, we use RNN, LSTM or GRU in these models to parse the input sequence to accumulate information and to generate the output sequence. But this approach suffers a few setbacks.
- Learning the long-range context in NLP with RNN gets more difficult as the distance increases.

- RNN is directional. In the example below, a backward RNN may have a better chance to guess the word “win” correctly.

To avoid making the wrong choice,

we can design a model including both forward and backward RNN (i.e. a bidirectional RNN) and then add both results together.

We can also stack pyramidal bidirectional RNN layers to explore context better.

But at one point, we may argue that in order to understand the context of the word “view” below, we should check over all the words in a paragraph concurrently. i.e. to know what the “view” may refer to, we should apply fully-connected (FC) layers directly to all the words in the paragraph.

However, this problem involves high-dimensional vectors and makes it like finding a needle in a haystack. But how do humans solve this problem? The answer may land on “attention”.
Attention
The picture below contains about 1M pixels. But most of our attention will focus on the blue-dress girl below.

When creating a context for our predictions, we should not put equal weight on all the information we get. We need to focus! We should create a context of what interests us. But this focus can shift in time. For example, if we are looking for the ferry, our attention may focus on the ticket booth and the waiting lines behind the second lamp post instead. So how can we conceptualize this into equations and deep networks?
In RNN, we make predictions based on the input xt and the historical output h(t-1) for timestep t.

For an attention-based system, we look into the whole input x on each step but x will be modified with attention.

We can visualize that the attention process masks out information that is currently not important.

For example, for each input feature xᵢ, we train an FC layer with tanh output to score how important feature i is (or the pixel) under the previous output h. For instance, our last output h maybe the word “ferry” and the score will be computed as:
Then, we normalize the score using a softmax function.

The attention Z will be the weighted output of the input features. In our example, the attention may fall around the ticket sign. Note, there are many realizations of the attention concept. The equation here is just one of them. The key point is we introduce an extra step to mask out information that we care less at the current step. In the next few sections, we will develop the concept further before we introduce the details and the math.
Query, Keys, Values (Q, K, V)
We can expand the concept of attention with queries, keys, and values. Let’s go through our example again. This time the query is “running”. In this example, a key classifies the object in a bounded rectangle and value is the raw pixels in the bounded box. In the example below, one of the boundary boxes contains the key “girl”. From some perspective, a key is simply an encoded value for “value”. But in some cases, a value itself can be used as a key.

To create attention, we determine the relevance between the query and the keys. Then we mask out the associated values that are not relevant to the query. For example, for the query “ferry”, the attention covers the waiting lines and the ticket sign below.

Now, let’s see how we apply attention to NLP and start our Transformer discussion. But Transformer is pretty complex. It is a new encoder-decoder concept with attention. We will take some time to discuss it.
Transformer
Many DL problems involve a major step of representing the input with a dense representation. This process forces the model to learn what is important in solving a problem. The extracted features are called the latent features, hidden variables or a vector representation. Word embedding creates a vector representation of a word that we can manipulate with linear algebra. One major problem is words can have different meanings in different contexts. In the example below, word embedding uses the same vector in representing “bank”. But it has different meanings in the sentence.

To create a dense representation of this sentence, we can apply RNN to parse a sequence of words in the form of embedding vectors. We gradually accumulate information in each timestep and produce a vector representation at the end of the pipeline. But one may argue that when the sentence is getting longer, early information may be forgotten or override. This may get worse if our input is a long paragraph.

Maybe, we should convert a sentence to a sequence of vectors instead, one vector per word. In addition, the context of a word will be considered during the encoding process through attention. For example, the word “bank” will be treated and encoded differently according to the context.

Let’s integrate this concept with attention using query, key, and value. We decompose sentences into single words. Each word acts as a value and we use the word itself as the key to its value.

Each word form a single query. So the sentence above has 21 queries. How do we generate attention for a query, say Q₁₆ for the word “bank”? We compute the relevancy of the query word “bank” with each key in the sentence. The attention is simply a weighted output of the values according to the relevancy. Conceptually, we “grey out” non-relevant values to form the attention.

By going through Q₁ to Q₂₁, we collect all 21 attentions. This 21-vectors represent the sentence above.
Transformer Encoder
Let’s get into more details. But in our demonstration, we use the sentence below instead which contains 13 words.
New England Patriots win 14th straight regular-season game at home in Gillette stadium.
In the encoding step, Transformer uses learned word embedding to convert these 13 words, in one-hot-vector form, into 13 512-D word embedding vectors. Then they are passed into an attention-based encoder to pick the context information for each word.
For each word-embedding vector, there will be one output vector. These 13 word-embedding vectors will fit into position-wise fully connected layers (details later) to generate a sequence of 13 encoded vectors in representing the sentence. Each of these output vector hᵢ will be encoded in a 512-D vector. Conceptually, the output hᵢ encodes the word xᵢ with its context taking into consideration.

Let’s zoom into this attention-based encoder more. The encoder actually stacks up 6 encoders on the left below. The output of an encoder is fed to the encoder above. Each encoder takes 13 512-D vectors and output 13 512-D vectors. For the first decoder (encoder₁), the input is the 13 512-D word embedding vectors.

Scaled Dot-Product Attention
The first part of each encoder performs the attention. Each word in the sentence serves as a single query. In our example, we have 13 words and therefore 13 queries. But we don’t compute attention one-at-a-time for each query.

Instead, all 13 attentions can be computed concurrently with Q, K, and V pack all the queries, keys and values into matrices. The result will be packed in a 13 × 512 matrix also. The matrix product QKᵀ measures the similarity between the query and the key. When the dimension of the vector increase, the dot products QKᵀ grow large and push the softmax function into regions with diminishing gradient problem. To correct that, Transformer divides the dot product with a scale factor related to the root of the dimension.
Multi-Head Attention
In the last section, we generate one attention per query. But we can pay attention to multiple areas.

Multi-Head Attention generates h attention per query. Conceptually, we pack h scaled dot-product attention together.

The diagram below shows two attention, one in green and the other in yellow.

For encoders and decoders in the Transformer, we use 8 attentions each. So why do we need 8 but not 1 attention? In each attention, we will transform Q, K, V linearly with a different trainable matrix respectively.

Each transformation gives us a different projection for Q, K, and V. So 8 attentions allow us to view relevancy from 8 different “perspectives”. This eventually pushes the overall accuracy higher, at least empirically.
The transformation also reduces their output dimension so even 8 attentions are used, the computational complexity remains about the same.

In multi-head attention, we concatenate the output vectors followed by a linear transformation.

Here is the encoder with multi-head attention.

Skip connection & Layer normalization
Transformer applies skip connection (residual blocks in ResNet) to the output of the multi-head attention followed by a layer normalization. Both techniques make training easier and more stable. In batch normalization, we normalize an output dimension based on the corresponding statistics collected from the training batches. In layer normalization, we use values in the same layer output to perform the normalization. This is more suitable for time sequence data. We will not elaborate them further and it is not critical to understanding the concept also. It is a common technique to make training more stable or easier.
Position-wise Feed-Forward Networks
Then we apply fully-connected layers (FC) with ReLU activation. But this operation is applied to each position separately and identically. It is a position-wise feed-forward because the ith output depends on the ith attention of the attention layer only.

Similar to the attention, Transformer also uses skip connection and layer normalization.
Positional Encoding
Politicians are above the law.
This sounds awfully wrong. But it demonstrates the ordering and the position of words matters. CNN or RNN discovers hierarchical local information nicely. But for the attention layer in Transformer, we don’t impose any explicit rules on the ordering of features. This makes training harder.
Positional Encoding encodes the absolute or relative positional information into the word embedding. It adds a position embedding with the same dimension to the word embedding. The final embedding of a word is the sum of the word embedding and the position embedding that models the word position (a.k.a. f(position)). This provides additional information to the model to learn better.

This position embedding can be fixed or learned. For a fixed position embedding, we can use the equation below. The periodic function allows us to embed relative word position information into the word embedding.

The position embedding can also be learned like other DL parameters. But we use two sets of position parameters aᵢⱼ, one for the values and one for the keys. aᵢⱼ models the position embedding between position i and j. We add the values in a to the attention function on the right below. This allows the attention output to depend on the query relevancy as well as the words’ position also. Therefore, we can learn aᵢⱼ just like W (details).

Alternatively, we can just model the relative distance between two words. So instead of modeling a(i, j), we learn w(j-i) where i and j are the corresponding word positions. But for words beyond a distance of k, we clip w and use the value of w(k) or w(-k) instead. Therefore, we only need to learn 2×k + 1 set of parameters. We don’t need to get into more details for now. If you want more information, please refer to the original research paper.

This is the encoder. Next, we will discuss the decoder. Nevertheless, this section is optional because BERT uses the encoder only. It will be nice to know the decoder. But it is relatively long and harder to understand. So skip the next six sections if you want.
Transformer Decoder (Optional)
The vector representation h of a sentence, created by the encoder, will fit into the decoder for training or inference. The following is a simplified view of the decoder during training.

As recalled, attention can be composed of a query, keys, and values. For the decoder, the vector representation h will be used as keys and values for the attention-based decoder. The ground truth label and the predictions will be used as the query for training and inference respectively.
In training, we shift the true label translation (“Los Patriots de …”) by one timestep to the right and fit it to the attention as the query. But we will defer the discussion on the attention-based decoder. We will come back to it later.

Embedding and Softmax in Training (Optional)
The output of the attention decoder is fit to a linear layer followed by a softmax in determining the output words. In the encoding-decoding process, we use word embeddings to transform words into vectors. The linear layer in the decoder is the inverse of the word embeddings process. In practice, these word embeddings and the linear layer can share the same weight (or its inverse). In fact, accuracy may improve.

Inference (Optional)
In inference, we predict one output label at a time. For the next time step, we can use all the previous predictions as the input query to the attention decoder.

Encoder-decoder attention (Optional)
Let’s get into the details of the encoder-decoder attention. That is the core part of the decoder. Recall previously, we apply linear transformations to the input word embeddings to create Q, K, and V respectively. In the encoder, the values for key, value and the query are originated from the same input word sequence.

For the Transformer decoder, the key and value are originated from the encoder output h. But the query is from the predicted output sequence so far. The decoder has two attention stages (in blue below).

The first stage prepares the vector representation for the query needed for the second stage. We will use the attention mechanism to create a vector representation for the predicted outputs so far and act as the input for the query of the attention in the second stage. Then, we combine K and V, prepared from the vector representation h from the encoder, to create the attention we need for the current time step.
Let’s consider what happened when translating “New England Patriots win …” to the ground truth “Los Patriots de Nueva Inglaterra …”). At step 3 of the decoding process, our prediction output so far is “Los Patriots”. We feed “Los Patriots” into the first attention stage to generate a vector representation. After the Layer Normalization, we apply a linear transformation on it. This is the Q we need for the second attention stage.

So to generate the attention, we use vector representation h for the sentence to generate K and V, and the previous attention output for Q. Intuitively, we ask for the output prediction “Los Patriots” so far, what is the attention in h that we should use to make the next prediction “de”.

Once the attention is computed, we pass it through the Position-wise Feed-Forward Network. The attention decoder stacks up these 6 decoders with last output passes through a linear layer followed by a softmax in predicting the next word “de”.

Here is the diagram for the whole Transformer.

Training (optional)
During training, we do know the ground truth. The attention model is not a time sequence model. Therefore, we can compute output predictions concurrently. For the input to the decoder, we just shift the ground truth word sequence to the right for one time step.

But, for the prediction at position i, we make sure the attention can only see the ground truth output from position 1 to i-1 only. Therefore, we add a mask in the attention to mask out information from position i and beyond when creating attention for position i.

Soft Label (Optional)
To avoid overfitting, the training also uses dropout and label smoothing. Label smoothing targets the probability prediction for the ground truth to be lower than 1.0 (say 0.9) and for non-ground truth to be higher than 0 (say 0.1). This avoids getting over-confidence with specific data. In short, being overconfidence about a data point may be a sign of overfitting and hurt us in generalizing the solution. We will not elaborate on all the details further here.
NLP Tasks
So far we have focused our discussion on sequence-to-sequence learning, like language translation. While this type of problem covers a wide range of NLP tasks, there are other types of NLP Tasks. For example, in question and answer (QA), we want to spot the answer in a paragraph regarding a question being asked.

There is another type of NLP task called Natural Language Inference (NLI). Each problem contains a pair of sentences: a premise and a hypothesis. An NLI model predicts whether a hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given a premise.

In these problems, the models take two separate sentences (or paragraphs) as input. Next, we are going to model a vector representation expandable to handle these tasks also. That will be the final discussion on BERT.
BERT (Bidirectional Encoder Representations from Transformers)
With word embedding, we create a dense representation of words. But in the section of Transformer, we discover word embedding cannot explore the context of the neighboring words well. Also, how can we create a dense representation for other NLP inputs, including those in QA and NLI? In addition, we want a representation model that is multi-purposed. NLP training is intense! Can we pre-trained a model and repurpose it for other resources without building a new model again?
Let’s have a quick summary of BERT. In BERT, a model is first pre-trained with data that requires no human labeling. Once it is done, the pre-trained model outputs a dense representation of the input. To solve other NLP tasks, like QA, we modify the model by simply adding a shallow DL layer connecting to the output of the original model. Then, we retrain the model with data and labels specific to the task.
In short, there is a pre-training phase in which we create a dense representation of the input (the left diagram below). The second phase retunes the model with task-specific data, like MNLI or SQuAD, to solve the target NLP problem.

Model
BERT uses the Transformer encoder we discussed to create the vector representation. In contrast to other approaches, it discovers the context concurrent rather than directionally.

Input/Output Representations
But first, let’s define how input is assembled and what output is expected for the pre-trained model. First, the model needs to take one or two word-sequences to handle different spectrums of NLP tasks.

All input will start with a special token [CLS] (a special classification token). If the input composes of two sequences, a [SEP] token will put between Sequence A and Sequence B.
If the input has T tokens, including the added tokens, the output will have T outputs also. Different parts of the output will be used to make predictions for different NLP tasks. The first output is C (or sometimes written as the output [CLS] token). It is the only output used to derive a prediction for any NLP classification task. For non-classification tasks with only one sequence, we use the remaining outputs (without C). For QA, the outputs corresponding to the paragraph sequence will be used to derive the start and the end span of the answer.

So, how do we compose the input embedding? In BERT, the input embedding composes of word piece embedding, segment embeddings, and position embedding of the same dimension. We add them together to form the final input embedding.

Instead of using every single word as tokens, BERT breaks a word into word pieces to reduce the vocabulary size (30,000 token vocabularies). For example, the word “helping” may decompose into “help” and “ing”. Then it applies an embedding matrix (V × H) to convert the one-hot vector Rⱽ to Rᴴ.
The segment embeddings model which sequence that tokens belong to. Does it belong to the first sentence or the second sentence. So it has a vocabulary size of two (segment A or B). Intuitively, it adds a constant offset to the embedding with value based on whether it belongs to sequence A or B. Mathematically, we apply an embedding matrix (2 × H) to convert R² to Rᴴ. The last one is the position embedding in H-Dimension. It serves the same purpose in the Transformer in identifying the absolute or relative position of words.
Pretraining
BERT pre-trains the model using 2 NLP tasks. The first one is the Masked LM (Masked Language Model). As shown below, we use the Transformer decoder to generate a vector representation of the input. Then BERT applies a shallow deep decoder to reconstruct the word sequence(s) back.

Here is an example of the Masked LM and BERT is trained to predict the missing words correctly.

Masked LM
In the Masked LM, BERT masks out 15% of the WordPiece. 80% of the masked WordPiece will be replaced with a [MASK] token, 10% with a random token and 10% will keep the original word. The loss is defined as how well BERT predicts the missing word, not the reconstruction error of the whole sequence.

We do not replace 100% of the WordPiece with the [MASK] token. This teaches the model to predict missing words, not the final objective of creating vector representations for the sequences with context taken into consideration. BERT replaces 10% with random tokens and 10% with the original words. This encourages the model to learn what may be correct or what be wrong for the missing words.
Next Sentence Prediction (NSP)
The second pre-trained task is NSP. The key purpose is to create a representation in the output C that will encode the relations between Sequence A and B. To prepare the training input, in 50% of the time, BERT uses two consecutive sentences as sequence A and B respectively. BERT expects the model to predict “IsNext”, i.e. sequence B should follow sequence A. For the remaining 50% of the time, BERT selects two-word sequences randomly and expect the prediction to be “Not Next”.

In this training, we take the output C and then classify it with a shallow classifier.

As noted, for both pre-training task, we create the training from a corpse without any human labeling.
These two training tasks help BERT to train the vector representation of one or two word-sequences. Other than the context, it likely discovers other linguistics information including semantics and coreference.
Fine-tuning BERT
Once the model is pre-trained, we can add a shallow classifier for any NLP task or a decoder, similar to what we discussed in the pre-training step.

Then, we fit the task-related data and the corresponding labels to refine all the model parameters end-to-end. That is how the model is trained and refined. So BERT is more on the training strategy rather than the model architecture. Its encoder is simply the Transformer encoder.
Model
But the model configuration in BERT is different from the Transformer paper. Here are a sample configuration used for the Transformer encoder in BERT.

For example, the base model stacks up 12 decoders, instead of 6. Each output vector has a 768 dimension and the attention uses 12 heads.
Source Code
For those interested in the source code for BERT, here is the source code from Google. For Transformer, here is the source code.
Next
NLP training is resource intense. Some BERT models are trained with 64 GB TPU using multiple nodes. Stay tuned and see how we may train a model in the next article.
Credit and References
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
