Building State-of-the-Art Language Models with BERT

Published in

Saarthi.ai

8 min readJun 26, 2019

2018 brought a revolutionary change in the field of Natural Language Processing (NLP), by introducing Transfer Learning. Bidirectional Encoder Representations from Transformers (BERT) is a classic example of Transfer Learning that was introduced by Google AI team in 2018 that shook the NLP community with its state of the art results in various NLP tasks.

Due to its highly pragmatic approach, and higher performance, BERT is used for various NLP tasks, achieving state-of-the-art results in language models. This post will cover an extensive explanation of BERT’s architecture with some important background details. We’ll also take a hands-on approach, using the PyTorch library, to test our discussions around how we achieve state-of-the-art results using the model.

TL;DR

There is a precisely very effective method to pre-train neural networks and create a pre-trained language model which is called language modelling.

Some things that go inside BERT:

Unlike old and conventional language models which used previous tokens to predict the next token, BERT takes both the next and previous token for prediction.
BERT is also specialized for next sentence prediction which makes it an appropriate choice for tasks such as question-answering or in sentence comparison.
The transformer architecture of encoder and decoder for encoding sentences is used by BERT, and with wide variations in parameters, its performance is remarkable even on small datasets.

Transfer Learning with BERT

Before deep-diving into BERT, let’s first understand the concept of transfer learning that BERT uses.

Traditional NLP models were restricted to word embeddings such as GloVe or word2vec, where every single word was linked to a vector, which used to represent some feature of its meaning. These embeddings were then trained on a huge unlabeled corpus, which was then used to train on labelled data to create specific models for different tasks like text classification and sentiment analysis. These models that are built, gain linguistic knowledge from big datasets. In such ways, word embeddings were useful in almost all NLP tasks. But all these came at a cost, which we will discuss going further into this article.

Word2vec or GloVe are trained on superficial language modelling applications, which makes it difficult to capture all meanings, especially as contexts changed. In very complex neural networks like LSTM, the language models built on word2vec didn’t capture nuances and meanings of the sentences. This made word embeddings using Word2vec or GloVe ineffective for language modelling.

To demonstrate this, let’s take two sentences as examples. The first sentence is “The cottage needs a good cleaning” and the second one is “He clean forgot about dropping the letters in the post box”. In both these instances, clean serves a different meaning, i.e, in the first example, clean is a noun, whereas, in the second example, clean is an adverb.

Models built on word embedding usually doesn’t consider any contexts. Previously, each word was allocated a vector by conventional word embedding methods, which forced them to take the meaning into account.

These shortcomings made a deep neural network like LSTM resolve to use transfer learning, where unlike word embedding, which used to map each word to a vector, they train a very deep neural network that maps a vector to every word which is entirely based on the context of the sentence or its surrounding.

The Core Idea

The basic task of a language model is to predict words in a blank, or it predicts the probability that a word will occur in that particular context. Let’s take another example:

“FC Barcelona is a _____ club”

Here language model may predict the blank word as football with 80% probability and may tell it as cricket with 20% probability, etc.

Typically, a language model is trained from “left to right”, thus being framed to predict the next word. In the example below,

“Tallest mountain is”

Here, a generic language model predicts the next word. This is a well-defined approach when we want to generate new sentences, where a language model can predict the next word. They add to the sentence and then continue to predicting and adding more sentences. But is it the only way to train a language model? More importantly, is this the most effective way to language modelling using deep learning?

BERT dismisses this fundamental originality of language models, that they must be framed for prediction from left to right. It intends to tell us that there is no need to train a language model from left to right if we don’t want to generate new sentences.

This, and this solely, is the key aspect of BERT. What it does is it randomly masks words in a particular context, and predicts them.

The architecture of a typical model using transfer learning.

This approach forces the model to learn to use relevant information from the entire sentence and find the masked words which make this method efficient.

There are some popular methods like Embeddings for Language Models (ELMo) and ULMFiT, which use and integrate the famous Bi-LSTM. Let us understand what Bi-LSTM is, and what made it imperfect before we understand how using BERT solved the issue. Bidirectional LSTM is trained both from left-to-right to predict the next word, and right-to-left, to predict the previous word. Meaning there are two LSTMs each for forward and backwards. But none of them took a look at both the ways at the same time. But, in BERT, the model is made to learn from words in all positions, meaning the entire sentence. Further, Google also used Transformers, which made the model even more accurate. This, essentially, differentiated BERT from all other Bi-LSTM based models.

The Architecture

BERT incorporates the mighty Transformer in its architecture, which uses attention mechanism over the input sentence. The transformer consists of many attention blocks, where each block transforms the input sequences using linear layers and applies attention to it. It basically aggregates layers that map seq2seq.

One point to be brought into the scenario is that BERT makes use of wordpiece tokenizer which reduces the size of vocab significantly. For example, running → run + ##ing .

Architecture of BERT

Transformers have a drawback that they don’t acknowledge the order of the input like in RNNs. Let’s say, if the first and last words are the same, they will be treated exactly as similar tokens. This problem is solved by BERT with positional embeddings, which tell the position of words in a sentence. Before feeding input tokens to forward network, these are embedded to the token embedding.

For tasks such as natural language inference and question-answering, BERT trains on paired sentences which helps it develop a unique embedding for the two sentences that distinguishes between the sentences. This unique embedding is called segment embedding.

Language Model Training

For the masked language model training of BERT, there are a few steps that have to be followed.

A highly unconventional method of training a masked language model is to randomly replace some percentage of words with [MASK] tokens. BERT is trained to do this, like, for every example, BERT masks 15% of the token at random.

But there is a drawback in this approach. The model only makes a prediction when a mask token is present. In simple w0rds, the model maybe neglectful if there is no masking in the input token. Our expectation from the model is only to give the correct prediction regardless of any input token we pass. So, let’s understand how Google solved this problem.

A bunch of words randomly selected from the sentence were replaced by random words. This step must be performed very cautiously as adding random words while replacing the original ones increases the noise level and concludes in poor and inappropriate results. Hence, BERT only swaps 10% to 15% of the tokens that had been chosen for masking which is approximately 1%(say) of the entire token, and another 10% is neither swapped nor masked and remaining 80% is masked with [MASK] token.

Predicting the next sentence using BERT

For excelling in tasks such as question-answering and natural language inference, BERT uses next sentence prediction.

Next input prediction example

When two sentences are taken as input, BERT applies a [SEP] token to separate these sentences. During pre-training of the language model, of the two sentences fed to BERT, 50% of the time the second sentence precedes the first sentence, and 50% of the time an entirely random sentence occurs. The job of the model is to predict if the second sentence was actually the next sentence.

Fine Tuning the Language Model

BERT encoder gives out a sequence of hidden states. Let’s say, for classification tasks, we only need a single vector for predictions, so the sequence needs to be deducted to a single vector. To do this there are two ways I would state here. First, by max or mean pooling, and second is using the power of attention. However, the easiest approach is to take the first token’s corresponding hidden state.

The question is, how does this pooling mechanism work?

BERT has one more special token, named classification, which is represented as [CLS]. The model cogitates this [CLS] token at the beginning of the sentence.

Takeaways

1. Size of the model matter. BERT_large is the largest model of its kind, with 345 million parameters which demonstrate superb performance on small datasets compared to BERT_base, which has 110 million parameters.

2. More training steps equals greater accuracy (if training data is enough). It can be proved by simply as on MNLI task, the accuracy of BERT increases by 1% when it is trained on 1 million steps as compared to when it was done on 500k steps on same batch size.

3. BERT’s Masked LM converges a bit slower than left-to-right LM training approaches. This is because only 15% of words are predicted in each batch. But Masked LM training still wins over left-to-right LM training after a few numbers of pre-training steps.

Conclusion

So, we saw how BERT was a major breakthrough in the field of NLP, and how it achieved the state of the art results on different language modelling tasks.

Just because of its masked LM approach, it brings a huge difference compared to other transfer learning approaches like ELMo or ULMFiT and supersedes their performance.

So, in this summary, I tried to take explain the core idea behind BERT and to dig deeper you must try out the amazing paper of BERT. For source code, you can check out the git repository.