Understanding BERT — The basics

Dharti Dhami
3 min readJul 4, 2020

--

Full credit to Chris Mccormick’s blog for helping me understand and distill BERT.

What is BERT and why has it become so popular ?

BERT (Bidirectional Encoder Representations from Transformers), released in late 2018 enabled transfer learning models in NLP. BERT is a method of pretraining language representations.

The idea of pre-training models followed by task-specific fine-tuning is in itself not new — computer vision practitioners regularly use models pre-trained on large datasets like ImageNet, and in NLP we have been doing “shallow” transfer learning for years by reusing word embeddings. But models like BERT, enabled major shift towards deeper knowledge transfer by transferring entire models to new tasks — essentially using large pre-trained language models as reusable language comprehension feature extractors. You can fine-tune these models on a specific task (classification, entity recognition, question answering, etc.) with your own data to produce state of the art predictions.

Advantages of Fine-Tuning

Why use BERT rather than train a train a specific deep learning model (a CNN, BiLSTM, etc.) that is well suited for the specific NLP task you need?

Quicker Development

The pre-trained BERT model weights already encode a lot of information about our language. As a result, it takes much less time to train our fine-tuned model — it is as if we have already trained the bottom layers of our network extensively and only need to gently tune them while using their output as features for our task. The authors recommend only 2–4 epochs of training for fine-tuning BERT on a specific NLP task (compared to the hundreds of GPU hours needed to train the original BERT model or a LSTM from scratch!).

Less Data

Because of the pre-trained weights this method allows us to fine-tune our task on a much smaller dataset than would be required in a model that is built from scratch. A major drawback of NLP models built from scratch is that we often need a prohibitively large dataset in order to train our network to reasonable accuracy.

Better Results

Finally, this simple fine-tuning procedure (typically adding one fully-connected layer on top of BERT and training for a few epochs) was shown to achieve state of the art results with minimal task-specific adjustments for a wide variety of tasks: classification, language inference, semantic similarity, question answering, etc. Rather than implementing custom and sometimes-obscure architectures shown to work well on a specific task, simply fine-tuning BERT is shown to be a better (or at least equal) alternative.

Pre-BERT NLP History

BERT model has been outperformed on the GLUE benchmarks (RoBERTa, XLNet, …), and new models are coming out every handful of months to one-up the prior state of the art. However, BERT is a landmark model similar to AlexNet for computer vision. Understanding BERT will make it easy to follow the latest developments in the pre-trained NLP models.

BERT is a departure from the LSTM-based approaches to NLP. So can we instead focus our understanding on Transformer architecture in BERT without going too deep in the recurrence or LSTMs, or even Attention in the context of LSTMs.

Ref: Link here

If you would like to look at other resources on deep learning and BERT, these are some of my recommendations.

Andrew Ng’s deep learning cource on coursera.

Google AI Blog Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

BERT — — The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

Transformer — — The Illustrated Transformer

Attention — — Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

In the next post let’s dive further into the inner-workings of BERT.

--

--

Dharti Dhami

Mom, Tech Enthusiast, Engineering lead @Youtube Music.