Prerequisites & background reading to understand Google BERT

Julian Harris
Speaking Naturally
Published in
5 min readJul 21, 2019

If you’re a keen Natural Language Processing practitioner, and have some deep learning background, Google BERT—while no longer cutting edge— is foundational in a number of ways and therefore is a valuable study.

The BERT paper 11 Oct 2018| blog post 2 Nov 2018| source code

So many tasty layers, just like Google BERT (credit: ME! This is Venchi Gelato in Richmond, Surrey, UK. Passionfruit & Mango, Strawberry Stracchiatella and Cuore de Cocoa/Dark Chocolate. My 4yo is in the background)

Below is a simple list of all the concepts in the original BERT paper that I have decided to call out as important prerequisites to understanding before you can really appreciate BERT. I’ve grouped them into three kinds of prerequisites:

  • Essential prerequisites: things that are essential part of BERT’s design
  • Reference only: the paper refers to them and are useful background.
  • Foundations: you should know this stuff, but these resources are particularly good and may help provide a fresh perspective on the world.

Essential prerequisites as a practitioner

The paper and blog posts are pretty readable to get an overview. But if you want to work with BERT — such as your own fine-tuning — it refers to these concepts below essential to understanding.

My suggestion: read through the paper once then review this list to decide which ideas would help you dive into better. I’ve ordered them roughly in terms of my own subjective knowledge needs.

Attention Networks: a new neural network architecture that takes a group of words and calculates the relationships between them (“the attention”) by Keita Kurita | by Jay Alamarr | from Distill.pub

Transformers: an application of Attention Networks. Jay Alamarr’s Illustrated Transformer | Annotated Transformer takes “Attention is all you need” and creates code alongside the paper to reproduce the ideas presented. Transformers are included in Google’s Tensor2Tensor library (see below).

Fine tuning & transfer learning: taking a base model and customising it for a specific task with a comparatively small amount of data (hundreds of examples vs billions of examples). See the ULMFiT paper (Jan 2018) that as far as I can tell introduces the concept of fine-tuning as a kind of transfer learning that can apply to any NLP task.

Language modelling: a prediction system that predicts missing words (either the next word or maybe as in BERT’s case, in the middle of a sentence). Jay Alamarr’s illustrated word2vec has a really great section on Language Modelling.

Language model pretraining: an integral part of fine-tuning: the initial work, often unsupervised (as is the case for BERT) that build a vectoral representation of language.

Contextual token representations: accepting that the same word means different things in different contexts. The first generation of word embeddings (word2vec, GloVe) were independent of context so “queen” was represented as the same vector regardless of the fact that it means several things. See this great comparison of the methods

Multi task learning: the ability for a system to learn multiple tasks at the same time when training. BERT learns two tasks at the same time: masked language modeling and next sentence prediction, the latter of which is key to contributing to BERT’s better understanding of context (“long-range dependencies”). Sebastian Ruder’s multi-task learning blog post from 2017 is still one of the best. Then there’s also Andrew Ng’s Multi-Task Learning video as part of deeplearning.ai

Positional embeddings: as BERT is not a recurrent neural network (it reviews things in parallel) it needs another way to represent spacial relationships and invents a new idea called positional embeddings. See this article on why BERT has 3 layers

Segment embeddings: BERT can handle two “sentences” to support a bunch of downstream tasks that require an input text and and output text such as question and answer, translation, paraphrasing, and others. It does this with segment embeddings. Again, see this article on why BERT has three layers

WordPiece Embeddings: an unsupervised way of tokenising words that helps balance granularity and knowledge retention with out-of-vocabulary scenarios. If you want to do custom vocabularies, WordPiece is not actually open sourced but there are other similar methods (such as, confusingly, SentencePiece that sounds like a different layer of abstraction but isn’t). See the value of WordPieces and discussion of WordPiece vs SentencePiece

Ablation studies: core feature of any research that excludes a component of the solution and retests to see what the impact of that component might be. For instance in BERT they removed the next sentence prediction task learning to see how it affected things and it did clearly show it contributed to “long-range dependency” understanding. This paper (Jan 2019) explores ablation studies and gives a good rationale for the kinds of insights you can expect to see. One that stood out for me was:

in general, the more a single unit’s distribution of incoming weights changes during training, the more important this unit is for the overall classification performance.”

Tensor2Tensor a general framework that makes deep learning research easier including multi-task learning. Blog post (June 2017) | source

Random restarts: I actually yet find any reference to this other than a reference by Jeremy Howard about “fast.ai you get to learn about cutting edge techniques…

Reference only

These ideas are referred to in the BERT paper. You’ll almost certainly come across these ideas too learning about the dependencies but they’re not central to understanding BERT.

  • Natural language inference (NLI)
  • Paraphrasing
  • Cloze tasks
  • Closed domain question and answering
  • LSTM
  • Denoising autoencoders
  • Machine translation: probably the highest visibility and most important NLP task because it led the frontier of new developments: firstly with neural machine translation (NMT) seq2seq and then the first application of transformers (TODO confirm).

Foundations

If you’re an NLP practitioner you should know this stuff; links here are to particularly good and up to date resources on them. It is by no means complete; more, a smattering of the best resources I’ve found.

  • Objective functions
  • Distributed representation
  • Word embeddings. Jay Alamarr presents NLP Word Embeddings (May 2019) and his Word2Vec blog post is second to none
  • Named entity recognition (NER): I’ve put this here though it’s pretty hard not to know NER if you’re doing NLP
  • Tokenization.
  • Fast.ai NLP Course for programmers (June 2019). Fast.ai is brilliant and this new course is a“code first” approach. However I find it uses lots of machine learning concepts that may get in the way of understandability so feel it has unarticulated prerequisites.

Other resources

In-depth comparison of BERT’s built-in feedforward network for sequence embedding classification vs constrained random fields (PDF, July 2019)

--

--

Julian Harris
Speaking Naturally

Ex-Google Technical Product guy specialising in generative AI (NLP, chatbots, audio, etc). Passionate about the climate crisis.