Understanding BERT — The basics

3 min readJul 4, 2020

Full credit to Chris Mccormick’s blog for helping me understand and distill BERT.

What is BERT and why has it become so popular ?

BERT (Bidirectional Encoder Representations from Transformers), released in late 2018 enabled transfer learning models in NLP. BERT is a method of pretraining language representations.

The idea of pre-training models followed by task-specific fine-tuning is in itself not new — computer vision practitioners regularly use models pre-trained on large datasets like ImageNet, and in NLP we have been doing “shallow” transfer learning for years by reusing word embeddings. But models like BERT, enabled major shift towards deeper knowledge transfer by transferring entire models to new tasks — essentially using large pre-trained language models as reusable language comprehension feature extractors. You can fine-tune these models on a specific task (classification, entity recognition, question answering, etc.) with your own data to produce state of the art predictions.

Advantages of Fine-Tuning

Why use BERT rather than train a train a specific deep learning model (a CNN, BiLSTM, etc.) that is well suited for the specific NLP task you need?

Quicker Development

The pre-trained BERT model weights already encode a lot of information about our language. As a result, it takes much less time to train our fine-tuned model — it is as if we have already trained the bottom layers of our network extensively and only need to gently tune them while using their output as features for our task. The authors recommend only 2–4 epochs of training for fine-tuning BERT on a specific NLP task (compared to the hundreds of GPU hours needed to train the original BERT model or a LSTM from scratch!).

Less Data

Because of the pre-trained weights this method allows us to fine-tune our task on a much smaller dataset than would be required in a model that is built from scratch. A major drawback of NLP models built from scratch is that we often need a prohibitively large dataset in order to train our network to reasonable accuracy.

Better Results

Finally, this simple fine-tuning procedure (typically adding one fully-connected layer on top of BERT and training for a few epochs) was shown to achieve state of the art results with minimal task-specific adjustments for a wide variety of tasks: classification, language inference, semantic similarity, question answering, etc. Rather than implementing custom and sometimes-obscure architectures shown to work well on a specific task, simply fine-tuning BERT is shown to be a better (or at least equal) alternative.

Pre-BERT NLP History

BERT model has been outperformed on the GLUE benchmarks (RoBERTa, XLNet, …), and new models are coming out every handful of months to one-up the prior state of the art. However, BERT is a landmark model similar to AlexNet for computer vision. Understanding BERT will make it easy to follow the latest developments in the pre-trained NLP models.

BERT is a departure from the LSTM-based approaches to NLP. So can we instead focus our understanding on Transformer architecture in BERT without going too deep in the recurrence or LSTMs, or even Attention in the context of LSTMs.