A dummy’s guide to BERT

Published in

The Startup

5 min readJun 18, 2020

This blog post provides a dummy’s guide/SparkNotes version of the paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” by Chang, Devlin, Lee and Toutanova (2019). Note: BERT is a transformer, so if you’d like to read up on transformers first, do read this blog post.

I find it both perplexing and delightful that there are not one, but TWO famous language representation models that are named after Sesame Street characters. Read about ELMo here. Amongst the Sesame Street characters that remain, Gonni-GAN seems the next most likely character to have a deep learning model named after him (get it?). BERT is one of the first bidirectional language representation models. This article situates BERT in relation to other language representation models, and discusses its model architecture and finally its applications. It is highly recommended that you play Sesame Street in the background as you read this post — studies show that it will improve your absorption of the materials.

BERT’s Family Tree

By family tree, I mean the family of neural language representation models. A language representation model, very simply put, is a set of vectors or weights that can represent a piece of text. Language representation models help us to select features & combinations of features of a text that are particularly relevant to comprehending it (so that we can then nicely classify, complete and do various other tasks on the text). Some common tasks we do by finetuning & building upon language representation models are text classification, question-answering, machine translation etc.

BERT’s closest siblings are ELMo (surprise, surprise) & GPT-2. If you’d like, you may consider word embedding models such as Word2Vec, fastText and GLoVe to be the ‘parents’ of these newer language models. The strongest differentiator between the two generations of models is that the word embedding models are context-insensitive whilst their kiddos are context-sensitive. What does this mean? Well, the word embedding models cannot differentiate between two senses of the same word. For example, consider the sentence ‘I can drink a can of soda.’ Word embedding models would not produce different representations for the two ‘cans’ in the sentence. BERT, ELMo and GPT-2 would. Both generations of models are often pretrained so as to be reused and finetuned for other tasks.

So let’s then focus on how we differentiate BERT from its two prominent siblings, ELMo and GPT-2. The authors of the BERT paper repeatedly emphasize its bidirectionality. By bidirectional, they mean that the model jointly conditions on both right and left context on all layers (Chang, Devlin, Lee, Toutanova, 2019). In plain English, this means that when we are predicting or representing a word/set of words, we use information that comes before & after this word/set of words (not only before or only after). GPT-2 uses a unidirectional, left-to-right framework, whilst ELMo uses a concatenation of a left-to-right framework and a right-to-left framework (rather than training both directions simultaneously as in BERT, meaning ELMo is less efficient and probably less informative).

But what is wrong with unidirectionality?

Intuitively, if our representation of a word is informed by context on both its left and right, the representation will be richer and perform better on a variety of tasks.

How does BERT achieve bidirectionality?

BERT uses two objectives to train a language representation, namely 1. masked language model (MLM) and 2. next sentence prediction (NSP). Let’s look at each of these in turn:

Masked Language Model (MLM) A constraint of models like ELMo and GPT-2 is that they HAVE to be unidirectional. Why? The authors of the BERT paper explain that in regular conditional language models, if we use bidirectional conditioning, each word can indirectly see itself. BERT avoids this problem by doing an MLM task. In an MLM task, we mask/hide some proportion of words in our input at random, and we attempt to predict these words. Because we are not doing this for every word, we avoid the issue of a word seeing itself when we train a bidirectional model! Cool, aye?
Next Sentence Prediction (NSP) In this task, we attempt to predict the next sentence (wow what a surprise given the name). Like the paper says, we can generate a dataset for this task from any corpus by making this a binary prediction problem and including 1. the true next sentence for a given input sentence as the positive output and 2. random negative samples (sentences which are not the next sentence) as the negative output.

Applications (via Finetuning)

Source: Chang, Devlin, Lee and Toutanova (2019)

Question-Answering

The authors were specifically referring to the SQuAD dataset, which maps questions to a span in a corresponding passage as an answer. Think about the reading comprehension problems that you may have done as an elementary school student. The task for the SQuAD dataset is essentially to underline the parts of the passage that correspond to an answer to a given comprehension question. We use BERT to embed the question & passage pairs, and then we add a layer that predicts the probability of a given span containing our answer. The maximum scoring span is used as a prediction.

Sentence Completion

For sentence completion, the authors used the Situations With Adversarial Generations(SWAG) dataset. They framed the problem as choosing or classifying amongst four continuations of a given input sequence. We use BERT to embed the concatenation of the input sequence with the four sequences and thus conceptualize this as a classification problem by adding one more layer to the BERT network during the fine-tuning stage.

References

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Vivinetto, G. (2018). No, Bert and Ernie aren’t gay — they’re ‘best friends,’ says ‘Sesame Street.’ Today. Retrieved from https://www.today.com/popculture/no-bert-ernie-aren-t-gay-says-sesame-street-they-t137717

Originally published at https://nicolenair.github.io/learninglibrary on June 18, 2020.

A dummy’s guide to BERT

BERT’s Family Tree

How does BERT achieve bidirectionality?

Applications (via Finetuning)

References

Written by Nicole Nair