Transformers

26 min readOct 1, 2021

Are you overwhelmed by the number of transformer models out there and have no clue of the similarities and the differences between them? Well, you’ve come to the right place. This story will give an overview of the state-of-the-art transformer models and how they relate to each other.

This story is organized as follows. In Section 1, I will give a brief introduction to the subject. In Section 2, I will provide an explanation for several machine learning concepts that will help you understand the Transformer model and its variants better. In Section 3, I will walk you through neural machine translation and how it was done before the Transformer. In Section 4, I will talk about the Attention mechanism and how it was invented. In Section 5, I will explain the Transformer model and its parts in detail. In Section 6, I will list the different tasks that use the Transformer. And finally, In Section 7, I will go over several variants of the Transformer, give an explanation of their similarities and differences, and which task should employ them. The story is concluded in Section 8.

1. Introduction

When it comes to natural language processing (NLP), a lot of progress has been made in the last 20 years. From statistical NLP, based on complex hand-written rules, to representation learning and deep neural network-style machine learning, achieving state-of-the-art results in many NLP tasks, the technology is accessible now more than ever. Recent advancements in NLP involve the Transformer — a novel neural network architecture capable of solving many NLP tasks. Even though these models are large, they are not so difficult to understand. But, before we dive into the actual models, let’s visit the basics.

2. Machine Learning Background

Here, I explain briefly several machine learning concepts that you will find helpful later on. You can safely skip this part if you are comfortable with machine learning.

Model selection

Besides the parameters that the model learns, every model has hyperparameters that need to be set beforehand. Choosing the optimal hyperparameters is usually done by: 1) dividing the dataset into 60%/20%/20% ratio (or the train/validation/test sets), 2) fixing the hyperparameters and training the model using the train set, 3) evaluating the performance on the validation set, 4) steps 2) and 3) are repeated until the best set of hyperparameters are obtained, 5) the final performance of the model is reported on the test set.

Activation functions

One of the successes of deep neural networks was their inherent ability to transform the input data such that the final task (ex. classification) is easy to solve. Transforming the data in a given layer is achieved by multiplying it with a matrix, W, something that the model learns. However, this transformation is linear. We would like to add non-linearity to expand the model’s capabilities. This is where activation functions come in handy. After multiplying the input with the matrix, W, we pass it through the activation function.

One such function is the ReLU (Rectified Linear Unit) activation function.

Besides non-linearity, it helps with better convergence of deep neural networks.

Another one is GELU (Gaussian Error Linear Unit).

It combines ReLU, Dropout, and Zoneout in one activation function.

Regularization

Deep neural networks have the tendency to overfit the training data thus generalizing poorly. The reason for this is that they are over-parametrized. Instead of decreasing the number of parameters to overcome overfitting, we could apply regularization and thus improve generalization.

Dropout is one technique for regularization. It works by stochastically removing some of the neurons in each layer. This, in principle, is similar to training an ensemble of neural networks and averaging the output of each.

Recurrent Neural Networks use Zoneout. It works similarly to Dropout but instead of removing, it preserves some hidden units.

Classification

Taking binary classification, for example, a neural network can output the probability of the input data belonging to the positive class. This is done by applying the sigmoid function to the output of the neural network.

However, for the general classification task, we would use the softmax function which outputs a probability distribution over the classes.

Representation learning

Textual data needs to be transformed into a suitable format for processing. This is where representation learning is used. For example, we can transform the word “Hello” to a vector of floating-point values and feed this to the model. This vector is called embedding and, depending on the representation learning algorithm, has nice properties.

3. Neural Machine Translation

Contrary to statistical phrase-based translation systems, neural machine translation uses neural networks to increase the translation performance, hence the word neural in its name. But not all neural networks are suitable for such tasks. One particular type of neural network that is used in neural machine translation is the Recurrent Neural Network (RNN) and its improved variants, like the Long Sort-Term Memory (LSTM) [1] and the Gated Recurrent Unit (GRU). [2]

These stateful networks were designed for handling sequential data by updating their hidden state whenever new information passes through them.

Model architecture

Most of the models used for neural machine translation have the encoder-decoder architecture. [3] Both the encoder and the decoder in most of the architectures are RNNs. The reason for using two RNNs (encoder and decoder) instead of one (only encoder) is that using only one RNN cannot be applied to problems where the input and output sequences are of different lengths and have non-monotonic relationships. Furthermore, instead of RNNs, LSTMs are used because they can capture long-range temporal dependencies. Both components are then jointly tuned.

What these models do is the following. For a given variable-length input sequence, the encoder produces a fixed-length vector which is then used by the decoder to produce a variable-length output sequence. Usually, the last hidden state from the encoder is used as a fixed-dimensional representation of the input sequence.

The drawback of this approach is that all of the input information needs to be compressed in a single vector, which is difficult to achieve for long input sequences, especially for sequences longer than the ones present in the training corpus.

So why not, instead of one vector, the encoder produces multiple vectors and then we allow for the decoder to choose a subset of these vectors adaptively when decoding the sequence?

4. Attention

First presented in [4], the translation works as follows.

Encoder-Decoder architecture with Attention

At the encoder, the input sentence is split into tokens and each token is embedded.

Then, the list of embedding vectors is passed through Bidirectional RNN (BiRNN) encoder. The BiRNN is composed of two RNNs, operating separately. The forward RNN encodes the list in the original order. The backward RNN encodes the list in reverse order.

Finally, the forward RNN’s hidden state and the backward RNN’s hidden state for each input vector are concatenated producing the so-called annotations.

These annotations are passed to the decoder. Here we see the difference between the original approach, of passing a single vector, and the improved one, of passing multiple vectors. The representational power of the latter is higher.

At the decoder, the output sentence is also split into tokens and each token is embedded.

Then, for each embedding vector, the decoder produces its context vector.

Here comes the attention.

The context vector is a weighted sum of the annotations, where the weights are produced by applying the sigmoid function to the output of the so-called alignment model. The alignment model is a feed-forward neural network. It scores how well the annotations and the decoder’s previous hidden state match.

Finally, both the context vector and the previous output vector are passed through an RNN decoder for predicting the next output token.

5. The Transformer

As a neural machine translation model, the transformer [5] follows the same encoder-decoder architecture, but instead of RNNs, the parts are a bit different.

The encoder and decoder are both composed of several layers.

Each layer in the encoder has two sub-layers: 1) multi-head self-attention, and 2) a point-wise fully connected feed-forward neural network. On top of each sub-layer, there is a layer normalization.

The decoder layers are similar, with two additions: 1) encoder-decoder attention, which sits between the two sub-layers mentioned previously, and 2) the multi-head self-attention is masked.

The difference between the transformer and the previous approach is that it completely relies on the attention mechanism and it can process all of the input tokens in parallel, something that RNNs fail to achieve. The way RNNs operate is by looking at their previous hidden state and the current input to produce the next output.

First, I will explain each of the sub-layers and then give the data flow in this architecture.

Layers

Let’s look at the sub-layers shared between both the encoder and the decoder.

Embeddings
The input to this layer is a list of tokens (ex. [“Hello”, “world”, “!”]).

For each token, a lookup table is traversed and the appropriate embedding vector is retrieved.

The output of this layer is a matrix, X, of stacked embedding vectors.

Positional encoding
The input to this layer is a matrix, X, of stacked embedding vectors.

A matrix, containing positional information, is added to X.

For the positional information, the sine function (for even positions) and cosine function (for odd positions) of different frequencies are used.

The output of this layer is a matrix obtained from the summation.

Since the model has no recurrence, for it to make use of the order of the sequence, the positional information is injected in this way.

Multi-head self-attention
The input to this sub-layer is a matrix, X, from the layer below.

Initially, the matrix X is composed of stacked embedding vectors with added positional encoding.

The input matrix, X, is multiplied with three matrices producing the Q, K, and V matrices, the queries, keys, and values respectively.

The Q and K matrices are then: 1) multiplied to get the attention scores, 2) scaled to stabilize the gradients, and 3) the resulting matrix passed through a softmax function (applied row-wise). The output probabilities of the softmax function are used to scale the V matrix, resulting in the output matrix, Z. This is self-attention. The name comes from the fact that we allow for each token to attend to the other tokens in the sentence.

It is reasonable to think that one token will attend to itself more than to others. That’s where multi-head comes into play. The Q, K, and V matrices are projected into h different subspaces using different sets of WQ, WK, and WV matrices. Then, self-attention is applied to each subspace. The obtained h matrices are then concatenated and multiplied with yet another matrix, WO, that produces the final output matrix, Z. The last multiplication is needed to preserve the output dimensionality.

The output of this layer is the resulting matrix Z.

There are four reasons for using self-attention layers as opposed to the recurrent (or even convolutional) layers: 1) Total computational complexity per layer — Self-attention layers are faster than recurrent layers when the sequence length is smaller than the representation dimensionality (which is most often the case),
2) Amount of computation that can be parallelized,
3) Path length between long-range dependencies in the network — The self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations, and
4) Interpretability — Not only do individual attention heads clearly learn to perform different tasks, but many also appear to exhibit behavior related to the syntactic and semantic structure of sentences.

Furthermore, this can be thought of as the parsing step for analyzing the hierarchical structure of the sentence.

Point-wise fully connected feed-forward neural network
The input to this sub-layer is a matrix, Z, from the sub-layer below.

It transforms the input matrix, Z, using two linear transformations with ReLU activations in between. The transformation is applied position-wise, meaning, to each vector separately and identically.

The output of this layer is the transformed matrix Z.

This enables the model to generate compositions between sentence constituents.

Layer normalization
The inputs to this sub-layer are two matrices: X, the input matrix to the sub-layer bellow, and Z, the output matrix from the sub-layer bellow.

It is implemented as a residual network. The matrices are summed and normalized.

The output of this layer is a matrix obtained from the normalized summation.

This helps by improving the train speed and stability.

Let’s look at the sub-layers specific to the decoder.

Masked multi-head self-attention
The only difference between the masked multi-head self-attention and the multi-head self-attention is that an upper triangular matrix with entries -inf, M, is added to the attention scores.

This prevents the decoder from attending to future tokens.

Encoder-decoder attention
The only difference between the encoder-decoder attention and the multi-head self-attention is that the Q and K matrices are initialized from the last layer of the encoder and the V matrix is initialized from the layer below.

This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models.

Linear layer and Softmax
The input to this layer is a matrix, Z, from the final layer of the decoder and outputs a probability distribution over the output vocabulary.

Data Flow

I will go over this part with an example.

Let’s consider the sentence “Hello World!” in English and we would like to use the transformer to translate it into French. Assume that the transformer was already trained to translate text from English to French.

First, the sentence is split into tokens: [“Hello”, “World”, “!”].

Second, each of the tokens is embedded and the positional encodings are added to the vectors.

Third, the stacked vectors are passed through each layer of the encoder.

This completes the encoder phase.

The Q and K matrices from the last layer of the encoder are used to initialize the Q and K matrices in each encoder-decoder attention sub-layer of the decoder.

A special token, [BOS] (beginning of sequence), is passed through the decoder in the same manner as the encoder: first, the token is embedded and the positional encoding is added to the vector and second, it is passed through each layer of the decoder.

The decoder will output the word “Bonjour”.

Next, we take both tokens, [[BOS], “Bonjour”], and repeat the process at the decoder again. This will output the word “le”. Here we see why we do not want to attend to future tokens (because we have not translated them yet).

We do this until the decoder outputs a special token, [EOS] (end of sequence), meaning the sequence was translated successfully.

If the transformer was not trained, the output from it will be a random word and we need to back-propagate and update the parameters of the model for each sequence pair in our training set. To accelerate the process, we could use batching, where instead of one sentence, we translate multiple sentences at a time, however for this to succeed, the length of the sequences needs to match. That is why we add special tokens, [PAD]. to increase the length of the shorter sequence.

6. Tasks

Here, I give a classification of the models based on the part of the transformer they are using as well as the task they are trying to solve.

Autoregressive models

These models use only the decoder.
They are used for text generation.
Models: GPT

Autoencoding models

These models use only the encoder.
They are used for sentence classification.
Models: BERT, RoBERTa, DistilBERT

Sequence-to-sequence models

These models use both the encoder and the decoder.
They are used for translation, summarization, or question answering.
Models: BART, mBART

The original transformer is a sequence-to-sequence model.

There are also multimodal and retrieval-based models, however the three above are most commonly used.

Timeline

06/2018 — GPT
05/2019 — BERT
07/2019 — RoBERTa
10/2019 — BART
01/2020 — mBART
03/2020 — DistilBERT

7. Models

GPT — Improving Language Understanding by Generative Pre-Training

GPT was developed by OpenAI and published in June 2018. [6] It was designed to pre-train a language model. It uses a multi-layer Transformer decoder.

The framework has two steps: 1) pre-training and 2) fine-tuning. For pre-training, it uses a standard language modeling objective. For fine-tuning, the model is initialized with the pre-trained parameters and tuned on different tasks.

GPT is effective when fine-tuned for text classification, entailment determination, semantic similarity assessment, and question answering.

There is one model size with 12 layers, a size of 768, and 12 attention heads.

The difference between GPT and the original Transformer is: 1) it uses GELU activation functions instead of ReLU and 2) it adds learned positional embeddings instead of sinusoidal.

Input

For pre-training, the input to the model is a list of tokens over a context window from an unsupervised corpus. For fine-tuning, the input is transformed depending on the task (explained below).

Three special tokens are used: 1) [START] added at the beginning of the sequence, 2) [SEP] added between the tokens of the two sentences, and 3) [END] added at the end of the sequence.

Ex: [“Hello World!”, “How are you?”] are converted to: [[START], “Hello”, “World”, “!”, [SEP], “How”, “are”, “you”, “?”, [END]]

Then, tokens are embedded and learned positional encoding is added.

Pre-training

GPT was pre-trained using BookCorpus.

Language Model

GPT predicts the next word by using a left-to-right model.

A list of tokens over a context window is passed through the model and the output of the final transformer’s block activation is passed through a softmax function to obtain a probability distribution over the vocabulary.

Fine-tuning

Every task requires only one linear layer on top of GPT. The input to this layer is the final transformer’s block activation. The output of this layer is usually normalized using a softmax function and the class with the highest probability is used.

Here, language modeling as an auxiliary objective is included. It improves generalization and accelerates convergence.

Entailment determination
For this task, the premise, P, and hypothesis, H, are concatenated:
[[START], P, [SEP], H, [END]],
the sequence is processed by the model, and the model’s output activations are passed to the linear layer.

Semantic similarity assessment
For this task, the two sentences, A and B, are concatenated in both order:
[[START], A, [SEP], B, [END]] and
[[START], B, [SEP], A, [END]],
each sequence is processed independently by the model and the model’s output activations are summed before passing them to the linear layer.

Question answering
For this task, the context document, C, and each possible answer, Ai, are concatenated:
[[START], C, [SEP], A1, [END]],
[[START], C, [SEP], A2, [END]], and
[[START], C, [SEP], A3, [END]],
each sequence is processed independently by the model and the model’s output activations are passed to the linear layer. The outputs of the linear layer are normalized to produce a distribution over the answers.

BERT — Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT was developed by Google AI and published in May 2019. [7] It was designed to pre-train deep bidirectional representations. It uses a bidirectional Transformer encoder.

The framework has two steps: 1) pre-training and 2) fine-tuning. For pre-training, it uses a masked language model objective and the next sentence prediction task. For fine-tuning, the model is initialized with the pre-trained parameters and tuned on different tasks.

BERT is effective when fine-tuned for question answering and language inference.

There are two model sizes: 1) BERT Base (for comparison with GPT) and 2) BERT Large. BERT Base has 12 layers, a size of 768, and 12 attention heads — 110M parameters. BERT Large has 24 layers, a size of 1024, and 16 attention heads — 340M parameters.

Input

The input to the model is a list of tokens from one or two sentences, A and B, concatenated together.

Two special tokens are used: 1) [CLS] added at the beginning of the sequence and its final hidden state corresponds to an aggregated sequence representation used for classification and 2) [SEP] added between the tokens of the two sentences.

Ex: [“Hello World!”, “How are you?”] are converted to: [[CLS], “Hello”, “World”, “!”, [SEP], “How”, “are”, “you”, “?”]

Then, the tokens are embedded using WordPiece embedding. Before positional encoding, a learned embedding for each token representing whether it belongs to sentence A or sentence B is added.

Pre-training

BERT was pre-trained using BooksCorpus and English Wikipedia using only the text passages.

Masked Language Model
Instead of predicting the next word using a left-to-right or right-to-left model (or even a shallow concatenation of the two), BERT does so by using a Cloze procedure or masked language model (MLM). In this way, the model that is obtained is bidirectional.

Masking is done by replacing 15% of the input tokens by 1) [MASK] token 80% of the time, 2) random token 10% of the time, or 3) unchanged 10% of the time. The reason for this is that the [MASK] token does not appear when fine-tuning the model. This overcomes the mismatch between pre-training and fine-tuning.

The final hidden vectors for the [MASK] tokens are passed through a feed-forward neural network + softmax function to obtain a probability distribution over the vocabulary.

Next Sentence Prediction
Question answering and natural language inference depend on understanding the relationship between two sentences. Because of this, BERT is pre-trained for a binarized next sentence prediction task.

The training data is created by taking two sentences, A and B, from a monolingual corpus such that: 1) sentence B comes after sentence A 50% of the time and 2) sentence B is a random sentence 50% of the time.

The representation for the [CLS] token is used for classification.

Fine-tuning

BERT was fine-tuned on 11 NLP tasks producing state-of-the-art results on all of them.

Question answering
For this task, the question Q, and the passage, P, are concatenated:
[[CLS], Q, [SEP], P],
the sequence is processed by the model, and the model’s output activations are passed to two linear layers. The first linear layer predicts the start token and the second linear layer predicts the end token of the answer span in the passage. The outputs of the linear layers are normalized using a softmax function and the start and end tokens with the highest probabilities are used. Their respective linear layer outputs are summed, giving the score for the predicted span.

For the case where the passage does not contain an answer, two scores are obtained: 1) for the non-null span (the same as above) and 2) for the null span (the [CLS] token is passed through both linear layers and the outputs are summed). The model predicts a non-null span if its score is higher than the null span + threshold (a hyperparameter, chosen using the validation set).

Language understanding
For this task, the two sentences, A and B, are concatenated:
[[CLS], A, [SEP], B],
the sequence is processed by the model, and the representation for the [CLS] token is passed to a linear layer. The output of the linear layer is normalized using a softmax function and the class with the highest probability is used.

Common sense inference
For this task, the sentence, S, and the possible continuation, Ci, are concatenated:
[[CLS], S, [SEP], C1],
[[CLS], S, [SEP], C2], and
[[CLS], S, [SEP], C3],
each sequence is processed independently by the model and the representation for the [CLS] token is passed to a linear layer. The results are normalized to produce a distribution over the continuations.

RoBERTa — A Robustly Optimized BERT Pre-training Approach

RoBERTa was developed by Facebook AI and published in July 2019. [8] It is a replication study of BERT pre-training that carefully measures the impact of many key hyperparameters and training data size.

The modifications include: 1) training the model longer, with bigger batches, over more data, 2) removing the next sentence prediction objective, 3) training on longer sequences, and 4) dynamically changing the masking pattern applied to the training data.

The contributions of RoBERTa are: 1) a set of important BERT design choices and training strategies and alternatives that lead to better downstream task performance, 2) usage of a novel dataset which confirms that using more data for pre-training further improves performance on down-stream tasks, 3) training improvements which show that masked language model pretraining, under the right design choices, is competitive with all other recently published methods.

The methods below were explored using a re-implementation of the BERT Base configuration.

Training Procedure Analysis

Static vs. Dynamic Masking
BERT performed masking during preprocessing — each training instance was duplicated 10 times and masked in 10 different ways. RoBERTa uses dynamic masking — the masking pattern is was generated before the sequence is fed into the model. Better results and additional efficiency benefits encourage the usage of dynamic masking.

Model Input Format and Next Sentence Prediction
BERT observed concatenation of two document segments (sampled contiguously from the same document or from distinct documents). The model was trained to predict whether the observed segments come from the same or distinct documents, via Next Sentence Prediction (NSP) loss. RoBERTa compared several training formats: 1) Segment-Pair + NSP, 2) Sentence-Pair + NSP, 3) Full-Sentences, and 4) Doc-Sentences. Results show improvement by using the Full-Sentences format without the NSP loss.

Training with Large Batches
Using large mini-batches can both improve optimization speed and end-task performance when the learning rate is increased appropriately. BERT Base was trained for 1M steps with a batch size of 256 sequences. This is equivalent to training for 125K steps with a batch size of 2K sequences, or training for 31K steps with a batch size of 8K sequences. RoBERTa uses batches of 8K sequences.

Text Encoding
BERT uses character-level Byte-Pair Encoding (BPE) vocabulary of size 30K, whereas RoBERTa uses larger byte-level BPE vocabulary of size 50K subword units, without any additional preprocessing or tokenization of the input.

Data

Having large quantities of text for pre-training improves end-task performance. For RoBERTa, five language corpora were considered. Totaling with 160GB of uncompressed text, the following were used: 1) BookCorpus + English Wikipedia (16GB), 2) CC-News from the English portion of the CommonCrawl News dataset (76GB), 3) OpenWebText, an open-source recreation of WebText (38GB), and 4) Stories, CommonCrawl subset that matches the story-like style of Winograd schemas (31GB).

The above modifications to the BERT pre-training procedure were aggregated and their combined impact was evaluated. This new configuration follows BERT Large architecture.

DistilBERT — A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter

DistilBERT was developed by Hugging Face and published in March 2020. [9] It was designed to pre-train a smaller general-purpose language representation model.

DistilBERT leverages knowledge distillation during pre-training and shows that it is possible to reduce the size of BERT by 40% while retaining 97% of its language understanding capabilities and being 60% faster.

The trend toward bigger models raised several concerns: 1) environmental cost of exponentially scaling the computational requirements to train such models and 2) the computational and memory requirements could hamper on-device real-time usage.

Knowledge Distillation

Knowledge distillation is a compression technique in which a compact model — the student, is trained to reproduce the behavior of a larger model — the teacher.

In supervised learning, a classification model’s training objective minimizes the cross-entropy between the model’s predicted distribution and the one-hot empirical distribution of training labels. With knowledge distillation, the one-hot empirical distribution is replaced by the soft target probabilities of the teacher. Furthermore, instead of standard Softmax, the student uses Softmax-temperature (the logits are divided by a constant, T) to smooth the output distribution. At inference, T=1. This is called distillation loss.

DistilBERT training objective is a linear combination of distillation loss, masked language modeling loss, and cosine embedding loss (used to align the directions of the student and teacher hidden states).

A distilled version of BERT

Student architecture
DistilBERT has the same general architecture as BERT, however, the number of layers is reduced by factor 2.

Student initialization
The student is initialized by taking every other layer from the teacher.

Distillation
The same pre-training practices from RoBERTa were used: 1) large batch sizes, 2) dynamic masking, and 3) removing the next sentence prediction objective.

Data
DistilBERT is trained using the same corpus as the original BERT model.

BART — Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

BART was developed by Facebook AI and published in October 2019. [10] It was designed to pre-train a sequence-to-sequence model. It uses a standard Transformer architecture (BERT — bidirectional encoder + GPT — left-to-right decoder).

The framework has two steps: 1) pre-training and 2) fine-tuning. For pre-training, it corrupts the text with an arbitrary noising function and learns to reconstruct the original text. For fine-tuning, the model is initialized with the pre-trained parameters and tuned on different tasks.

BART is effective when fine-tuned for text generation but also works well for comprehension tasks.

There are two model sizes: 1) BART Base and 2) BART Large. BART Base has 12 layers (6 in the encoder and 6 in the decoder), a size of 768, and 12 attention heads. BART Large has 24 layers (12 in the encoder and 12 in the decoder), a size of 1024, and 16 attention heads.

BART uses the standard sequence-to-sequence Transformer architecture however it uses GELU activation functions instead of ReLU (as GPT).
Furthermore, the architecture is closely related to BERT, with the following differences: 1) each layer of the decoder additionally performs cross-attention over the final hidden layer of the encoder (as in the transformer sequence-to-sequence model), and 2) BART does not use a feed-forward neural network for word prediction (whereas BERT does).

Input

For pre-training, the input to the model is a noised version of the original input. For fine-tuning, the input depends on the task (explained below).

Pre-training

BART is trained by corrupting documents and then optimizing a reconstruction loss — the cross-entropy between the decoder’s output and the original document. Unlike existing denoising autoencoders, which are tailored to specific noising schemes, BART allows us to apply any type of document corruption. In the extreme case, where all information about the source is lost, BART is equivalent to a language model.

Token masking
Random tokens are replaced with [MASK] tokens.
Token deletion
Random tokens are deleted from the input sequence.
Token infilling
Random token spans are replaced with a single [MASK] token.
Sentence permuatation
The document is divided into sentences (based on full stops) and these sentences are shuffled.
Document rotation
A random token is chosen and the sentence rotated so it begins with that token.

Fine-tuning

The representations produced by BART can be used in several ways for downstream applications.

Sequence classification tasks
For this task, the same input is fed in both the encoder and decoder. Instead of using the [CLS] token for classification, BART adds an additional token at the end and the final hidden state of the final decoder for this token is fed into a new multi-class linear classifier.

Token classification tasks
For this task, the same input is fed in both the encoder and decoder. The top hidden state of the decoder is used as a representation for each word. The representation is used to classify the token.

Sequence generation tasks
For this task, BART autoregressive decoder can be fine-tuned for sequence generation tasks like abstractive question answering and summarization. For both tasks, information is copied from the input but manipulated, closely related to the denoising pre-training objective. The encoder input is the input sequence and the decoder generates outputs autoregressively.

Machine translation
For this task, BART uses additional transformer layers that are trained to translate a foreign language to noised English. These layers, replace the encoder embedding layer and are trained in two steps: 1) they freeze most of BART parameters and only update the source encoder, positional embeddings, and the self-attention input projection matrix of the first encoder layer, and 2) train all model parameters for a small number of iterations.

mBART — Multilingual Denoising Pre-training for Neural Machine Translation

mBART was developed by Facebook AI and published in January 2020. [11] It was designed to pre-train a sequence-to-sequence model pre-trained on large-scale monolingual corpora in many languages using the BART objective.

The framework has two steps: 1) pre-training and 2) fine-tuning. For pre-training, a complete autoregressive model is trained with an objective that noises and reconstructs full texts across many languages. For fine-tuning, the model is initialized with the pre-trained parameters that can be used for any of the language pairs in both supervised and unsupervised settings.

Furthermore, mBART enables new types of transfer: 1) fine-tuning on one language pair creates a model that can translate from all other languages to the target language with no further training and 2) languages not in the pre-training corpora can benefit from mBART, suggesting that the initialization is partially language universal.

mBART is effective on a wide variety of machine translation (MT) tasks.

mBART uses the BART Large model size.

mBART employs only two types of noise functions: 1) remove spans of text and replace them with a [MASK] token and 2) permitting the order of sentences within each instance.

Input

A special token, [LID], is sampled that denotes the language ID. Then, many consecutive sentences are sampled from the corresponding corpus, until the document boundary or max token length is reached. Sentences in the instance are separated by the end of a sentence, [S], token. Then the [LID] token is appended to represent the end of the instance. Training on multi-sentence level helps with both sentence and document translation.

For fine-tuning on the document-level MT task, the same pre-processing explained above is used.

Pre-training

mBART is pre-trained using 25 languages from the Common Crawl dataset. The corpus was rebalanced by up/down-sampling. Different levels of multilinguality were measured by training several models: 1) mBART25 — model pre-trained on all 25 languages, 2) mBART06 — model pre-trained on 6 European languages, 3) mBART02 — bilingual models pre-trained on English and one other language, 4) BART-En/Ro — baseline model.

Fine-tuning

mBART was fine-tuned on two main tasks: 1) sentence-level MT and 2) document-level MT.

For sentence-level MT a multilingual pre-trained model is fine-tuned on single pair of bi-text data, feeding the source language into the encode and decoding the target language. For decoding, it uses a beam-search with a beam size of 5 for all directions.

For document-level MT, the same fine-tuning scheme as for sentence-level MT was used. For decoding, however, the source sentences are packed into blocks and each instance block is translated autoregressive. The model doesn't know how many sentences to generate in advance and decoding stops when [LID] is predicted. Beam size is 5 by default.

mBART was also evaluated on an unsupervised MT task, where no bi-text is available for the target language pair.

8. Conclusion

We’ve covered a lot of ground here. These models are really useful when used appropriately and you don’t need to have a cluster of GPUs to experiment with them or include them in your machine learning project.

I’ve decided to write this story because on several occasions I was interacting with them without even knowing how they operate and even relate to each other. This story is for everyone in a similar situation. I really hope that people will turn to it whenever having difficulties using the Transformer. For future work, I plan to extend this with other publically available models.

References

[1] Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 9.8 (1997): 1735–1780.
[2] Cho, Kyunghyun, et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation.” arXiv preprint arXiv:1406.1078 (2014).
[3] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems. 2014.
[4] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).
[5] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.
[6] Radford, Alec, et al. “Improving language understanding by generative pre-training.” (2018).
[7] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
[8] Liu, Yinhan, et al. “Roberta: A robustly optimized bert pretraining approach.” arXiv preprint arXiv:1907.11692 (2019).
[9] Sanh, Victor, et al. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” arXiv preprint arXiv:1910.01108 (2019).
[10] Lewis, Mike, et al. “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.” arXiv preprint arXiv:1910.13461 (2019).
[11] Liu, Yinhan, et al. “Multilingual denoising pre-training for neural machine translation.” Transactions of the Association for Computational Linguistics 8 (2020): 726–742.

Transformers

1. Introduction

2. Machine Learning Background

Model selection

Activation functions

Regularization

Classification

Representation learning

3. Neural Machine Translation

Model architecture

4. Attention

5. The Transformer

Layers

Data Flow

6. Tasks

Autoregressive models

Autoencoding models

Sequence-to-sequence models

Timeline

7. Models

GPT — Improving Language Understanding by Generative Pre-Training

Input

Pre-training

Language Model

Fine-tuning

BERT — Pre-training of Deep Bidirectional Transformers for Language Understanding

Input

Pre-training

Fine-tuning

RoBERTa — A Robustly Optimized BERT Pre-training Approach

Training Procedure Analysis

Data

DistilBERT — A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter

Knowledge Distillation

A distilled version of BERT

BART — Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Input

Pre-training

Fine-tuning

mBART — Multilingual Denoising Pre-training for Neural Machine Translation

Input

Pre-training

Fine-tuning

8. Conclusion

References

Written by Mladen Korunoski