Paper Summary: BERT

4 min readNov 21, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/20.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

This brings us up to the state of the art. As of last month anyway, which is this field might be ancient history. It seems to me that the folks at Google looked at the recent OpenAI result and were not impressed. Or more likely they were already working on a pre-trained Transformer-based architecture and they were annoyed at getting scooped. In any case, this paper is clearly a response to / smackdown of this summer’s Generative Pre-trained Transformer (GPT) paper from Radford et al. It’s also a followup to ULMFiT and ELMo. The main difference here is finding a way to use bidirectional information — at all, compared to GPT and ULMFiT; in a deep, jointly trained way, compared to ELMo.

BERT follows the GPT architecture closely, going as far as to copy many of the dimensions and other hyperparameters of the Transformer, so as to be easily comparable. Or at least they do that for their base model; they also, in a we-have-more-and-better-hardware-than-you kind of move, train a large model with 3x as many (340M!) parameters. (They also pre-train on more data — English Wikipedia as well as BookCorpus — along with some other minor differences.) The base model has 12 Transformer layers, hidden dimension of 768, and 12 attention heads; the large model has 24 layers, dim 1024, and 16 heads. The input is also handled a bit differently:

They use WordPiece embeddings (with split words: playing → play + ##ing)
They include separator tokens [CLS], [SEP] in the pre-train input (GPT had to learn the separator tokens in fine-tuning)
They split the sentences into an A and B segment (more on this below), and add a segment embedding to the token and position embeddings before the first Transformer layer

So how can this bidirectional Transformer possibly work? If you allow information about future tokens to leak into the network, predicting the future becomes trivial and the network won’t learn anything useful. And combining left-to-right and right-to-left information in a deep way (as opposed to a shallow ensemble) seems to necessitate this kind of leakage.

The authors’ answer is a masked language model (MLM) that randomly masks out 15% of the words in the input and predicts just those masked words. Contextual information flows in both directions. There is a worry that the MLM might be less data efficient, and indeed it converges a bit slower than a unidirectional model… but this turns out not to be a problem since it quickly outperforms the unidirectional model in terms of accuracy. The architecture now looks more like the “Transformer encoder” than the decoder used by GPT.

The details of how words are masked are worth going into. As noted, 15% of words are masked and predicted. But actually only 80% of these are actually replaced with a [MASK] token. Of the remaining 20%, half are replaced with a random word and half are left intact. In practice the model has to try to validate (and therefore build a contextual representation of) every word, since it doesn’t “know” which 1.5% (10% of 15%) of words have been swapped out or which 15% of words it will be asked to predict. This is extremely clever and it clearly works well — yet somehow it feels like cheating or at least getting away with something.

The authors added a second pre-training task: next sentence prediction. The idea is similar to Skip-Thought or Quick-Thought vectors and is meant to encourage learning relationships between sentences. It works like this: for consecutive sentences A and B in the input, half the time replace B with a random sentence from the corpus; otherwise leave B as is. The output is a binary classifier that predicts whether the sentence was swapped out (NotNext) or not (IsNext). The pre-trained model ends up with 97–98% accuracy on this task, which sounds pretty good to me (I’m not sure I could do better).

Pre-training took 4 days for both models. They used 4 cloud TPUs (16 cores) and 16 cloud TPUs (64 cores) respectively. (Comparing to the 8 GPUs for 1 month figure from OpenAI’s GPT, this is a 7.5x hardware speedup, which means that each TPU core is almost 4x as fast as a GPU.)

The fine-tuning step also has some small differences compared to GPT. For classification they attach a softmax classification head to the output location corresponding to the special [CLS] token at the start of each pair of sentences and learn the weights of that layer. Question answering and sentence tagging are treated as token-level tasks, so they attach a new output layer to every token location. This is kind of awkward, in my opinion — I prefer the GPT’s generative approach, which strikes me as more flexible.

Ablation analysis indicates that next-sentence prediction is helpful, but the largest effect is using the MLM over a unidirectional model. As mentioned, the MLM does have slower convergence, but nevertheless outstrips the accuracy of unidirectional models after <100k training steps, well before convergence. Another benefit of BERT is that it can be used as an ELMo-style feature extractor for nearly as good results without fine-tuning.

Paper Summary: BERT

Written by Mike Plotz