Paper Summary: Improving Language Understanding by Generative Pre-Training

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/19.

Improving Language Understanding by Generative Pre-Training (2018) (https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever

This paper is in some ways just the next logical step in the progression of recent NLP results, and in other ways it’s a technical marvel: it beat out the state of the art on 9 out of 12 varied NLP tasks. It builds on ULMFiT and the Transformer, so you may want to check out those summaries first. The observation driving this work is that while there have been many attempts to apply pre-training and fine-tuning to NLP, there is no clear leader and no consensus on how to adapt a pre-trained model to new tasks. Particularly problematic is the need to make architectural adjustments (as with CoVe and ELMo); this paper uses a simple pre-processing approach that fits structured tasks to the architecture rather than the other way around.

The approach roughly follows ULMFiT, though the training process is in only two phases and is somewhat simplified. It also swaps out ULMFiT’s LSTM with a Transformer, which is better at capturing long-term dependencies. Note also that the Transformer used here is actually a Transformer decoder model — that is, it contains only decoder layers and has one multi-head attention sublayer for each layer. The present work also uses an auxiliary LM objective in the fine-tuning phase, which (my interpretation) is a contributing factor to their ability to get away with two phases instead of ULMFiT’s three.

The first phase, unsupervised pre-training, is quite standard. The authors train a 12-layer Transformer decoder model with masked self-attention (using 768-d vectors and 12 attention heads, and 3072-d layers in the feed-forward blocks, so beefier than the original Transformer). The masking is necessary to avoid looking at the words to be predicted. The objective L1 maximizes log likelihood of the next word, summed over the corpus, given a context window of 512 contiguous tokens. They used BookCorpus, which contains long term dependencies (this turns out to be important).

The second phase, supervised fine-tuning, considers sets of inputs x1, …, xm and an output label y. The output is predicted by introducing a new weight matrix Wy, which is multiplied by the last Transformer activation and fed into a softmax to produce output probabilities P(y). Again the objective (L2) is to maximize the sum of the log probabilities, though in fact the objective from the first phase is kept and used as an auxiliary objective. So the final objective is L3 = L2 + λL1 (λ = 0.5). The only thing necessary to learn in this phase is Wy (plus embeddings for special delimiter tokens, which I’ll describe in a moment), so fine-tuning is relatively fast.

So what about tasks that don’t involve predicting a next word? This paper solves this by applying “traversal-style” pre-processing (Rocktäschel et al 2015), which modifies structured input (like multiple choice questions) by concatenating the input parts along with special <s>, <e>, and $ delimiters. Entailment is simple — it’s just <s> + premise + $ + hypothesis + <e>. Sentence similarity works similarly, but since the task is symmetric, both orderings are tried and added together. Multiple choice is n model evaluations of the form <s> + context + $ + answer i + <e>, combined with a softmax. (See diagram below.)

Some notes on setup. They used bytepair encoding (Sennrich 2015) for the vocabulary, which is a way to extract subword information (compare to fastText, and also ELMo’s character-level convolutions). They also used GELU (Hendrycks and Gimpel 2016) for activations instead of ReLU, and Loshchilov and Hutter 2017’s weight decay fix for Adam. Another departure from the Transformer paper was the use of learned positional encodings rather than the sinusoidal ones (too bad, I thought those were pretty cool).

In ablation tests the authors found that sometimes the auxiliary LM objective helped (especially with larger datasets), and sometimes it didn’t. They also investigated zero-shot learning (!), meaning the pre-trained model was able to perform better than random guessing without any labeled data, and this effect improved with increasing levels of pre-training. The coolest zero-shot result was to extract sentiment by appending the token “very” to the input and then restrict the output to the words “positive” and “negative” — this resulted in a respectable ~80% accuracy.

One problem with this approach is the compute required. The pre-training phase took 1 month to train on 8 GPUs, according to the blog post. Fortunately the pre-trained model is available for download.


Hendrycks and Gimpel 2016 “Gaussian Error Linear Units (GELUs)” https://arxiv.org/abs/1606.08415

Loshchilov and Hutter 2017 “Fixing Weight Decay Regularization in Adam” https://arxiv.org/abs/1711.05101

Rocktäschel et al 2015 “Reasoning about Entailment with Neural Attention” https://arxiv.org/abs/1509.06664

Sennrich et al 2015 “Neural Machine Translation of Rare Words with Subword Units” https://arxiv.org/abs/1508.07909