Demystifying BERT: The Groundbreaking NLP Framework

Mohd Sanad Zaki Rizvi

Published in

Analytics Vidhya

13 min readSep 25, 2019

Introduction to the World of BERT

It’s not an exaggeration to say that Google AI’s BERT has significantly altered the NLP landscape.

Imagine using a single model that is trained on a large unlabelled dataset to achieve State-of-the-Art results on 11 individual NLP tasks. And all of this with little fine-tuning. That’s BERT! It’s a tectonic shift in how we design NLP models.

BERT has inspired many recent NLP architectures, training approaches and language models, such as Google’s TransformerXL, OpenAI’s GPT-2, XLNet, ERNIE2.0, RoBERTa, etc.

Note: In this article, we are going to talk a lot about Transformers. If you aren’t familiar with it, feel free to read this article first — How do Transformers Work in NLP? A Guide to the Latest State-of-the-Art Models.

What is BERT?

You’ve heard about BERT, you’ve read about how incredible it is, and how it’s potentially changing the NLP landscape. But what is BERT in the first place?

Here’s how the research team behind BERT describes the NLP framework:

BERT stands for Bidirectional Encoder Representations from Transformers.
It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context.
As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.

That sounds way too complex as a starting point. But it does summarize what BERT does pretty well so let’s break it down.

First, it’s easy to get that BERT stands for Bidirectional Encoder Representations from Transformers. Each word here has a meaning to it and we will encounter that one by one in this article. For now, the key takeaway from this line is — BERT is based on the Transformer architecture.

Second, BERT is pre-trained on a large corpus of unlabelled text including the entire Wikipedia(that’s 2,500 million words!) and Book Corpus (800 million words).

This pre-training step is half the magic behind BERT’s success. This is because as we train a model on a large text corpus, our model starts to pick up the deeper and intimate understandings of how the language works.
This knowledge is the swiss army knife that is useful for almost any NLP task.

Third, BERT is a “deeply bidirectional” model. Bidirectional means that BERT learns information from both the left and the right side of a token’s context during the training phase.

The bidirectionality of a model is important for truly understanding the meaning of a language. Let’s see an example to illustrate this. There are two sentences in this example and both of them involve the word “bank”:

*BERT captures both the left and right context*

If we try to predict the nature of the word “bank” by only taking either the left or the right context, then we will be making an error in at least one of the two given examples.

One way to deal with this is to consider both the left and the right context before making a prediction. That’s exactly what BERT does!

And finally, the most impressive aspect of BERT. We can fine-tune it by adding just a couple of additional output layers to create state-of-the-art models for a variety of NLP tasks.

From Word2Vec to BERT: NLP’s Quest for Learning Language Representations

“One of the biggest challenges in natural language processing is the shortage of training data.
Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labelled training examples.” — Google AI

Word2Vec and GloVe

The quest for learning language representations by pre-training models on large unlabelled text data started from word embeddings like Word2Vec and GloVe.

These embeddings changed the way we performed NLP tasks. We now had embeddings that could capture contextual relationships among words.

These embeddings were used to train models on downstream NLP tasks and make better predictions.

One limitation of these embeddings was the use of very shallow Language Models. This meant there was a limit to the amount of information they could capture.

Another key limitation was that these models did not take the context of the word into account. Let’s take the above “bank” example. The same word has different meanings in different contexts, right?

However, an embedding like Word2Vec will give the same vector for “bank” in both the contexts.

That’s valuable information we are losing.

Enter ELMO and ULMFiT

ELMo was the NLP community’s response to the problem of Polysemy — same words having different meanings based on their context.

From training shallow feed-forward networks (Word2vec), we graduated to training word embeddings using layers of complex Bi-directional LSTM architectures.

This meant that the same word can have multiple ELMO embeddings based on the context it is in.

That’s when we started seeing the advantage of pre-training as a training mechanism for NLP.

ULMFiT took this a step further. This framework could train language models that could be fine-tuned to provide excellent results even with fewer data (less than 100 examples) on a variety of document classification tasks. It is safe to say that ULMFiT cracked the code to transfer learning in NLP.

This is when we established the golden formula for transfer learning in NLP:

Transfer Learning in NLP = Pre-Training and Fine-Tuning

Most of the NLP breakthroughs that followed ULMFIT tweaked components of the above equation and gained state-of-the-art benchmarks.

OpenAI’s GPT

OpenAI’s GPT extended the methods of pre-training and fine-tuning that were introduced by ULMFiT and ELMo.

GPT essentially replaced the LSTM-based architecture for Language Modeling with a Transformer-based architecture.

The GPT model could be fine-tuned to multiple NLP tasks beyond document classification, such as common sense reasoning, semantic similarity, and reading comprehension.

OpenAI’s GPT validated the robustness and usefulness of the Transformer architecture by achieving multiple State-of-the-Arts.

And this is how Transformer inspired BERT and all the following breakthroughs in NLP.

Moving onto BERT

So, the new approach to solving NLP tasks became a 2-step process:

Train a language model on a large unlabelled text corpus (unsupervised or semi-supervised)
Fine-tune this large model to specific NLP tasks to utilize the large repository of knowledge this model has gained (supervised)

With that context, let’s understand how BERT takes over from here to build a model that will become a benchmark of excellence in NLP for a long time.

How Does BERT Work? A Look Under the Hood

Let’s look a bit closely at BERT and understand why it is such an effective method to model language.

1. BERT’s Architecture

The BERT architecture builds on top of Transformer. We currently have two variants available:

BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters

The BERT Base architecture has the same model size as OpenAI’s GPT for comparison purposes. All of these Transformer layers are Encoder-only blocks.

If your understanding of the underlying architecture of the Transformer is hazy, I will recommend that you read about it here.

Now that we know the overall architecture of BERT, let’s see what kind of text processing steps are required before we get to the model building phase.

2. Text Preprocessing

The developers behind BERT have added a specific set of rules to represent the input text for the model. Many of these are creative design choices that make the model even better.

For starters, every input embedding is a combination of 3 embeddings:

Position Embeddings: BERT learns and uses positional embeddings to express the position of words in a sentence. These are added to overcome the limitation of Transformer which, unlike an RNN, is not able to capture “sequence” or “order” information
Segment Embeddings: BERT can also take sentence pairs as inputs for tasks (Question-Answering). That’s why it learns a unique embedding for the first and the second sentences to help the model distinguish between them. In the above example, all the tokens marked as EA belong to sentence A (and similarly for EB)
Token Embeddings: These are the embeddings learned for the specific token from the WordPiece token vocabulary

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings.

Such a comprehensive embedding scheme contains a lot of useful information for the model.

These combinations of preprocessing steps make BERT so versatile.

3. Pre-training Tasks

BERT is pre-trained on two NLP tasks:

Masked Language Modeling
Next Sentence Prediction

Let’s understand both of these tasks in a little more detail!

a. Masked Language Modeling (Bi-directionality)

BERT is designed as a deeply bidirectional model. The network effectively captures information from both the right and left context of a token from the first layer itself and all the way through to the last layer.

Traditionally, we had language models either trained to predict the next word in a sentence (right-to-left context used in GPT) or language models that were trained on a left-to-right context. This made our models susceptible to errors due to loss in information.

ELMo tried to deal with this problem by training two LSTM language models on left-to-right and right-to-left contexts and shallowly concatenating them. Even though it greatly improved upon existing techniques, it wasn’t enough.

That’s where BERT greatly improves upon both GPT and ELMo. Look at the below image:

The arrows indicate the information flow from one layer to the next. The green boxes at the top indicate the final contextualized representation of each input word.

It’s evident from the above image: BERT is bi-directional, GPT is unidirectional (information flows only from left-to-right), and ELMO is shallowly bidirectional.

About Masked Language Models

This is where the Masked Language Model comes into the picture.

Let’s say we have a sentence — “I love to read data science blogs on Analytics Vidhya”. We want to train a bi-directional language model.

Let’s replace “Analytics” with “[MASK]”. This is a token to denote that the token is missing.

We’ll then train the model in such a way that it should be able to predict “Analytics” as the missing token: “I love to read data science blogs on [MASK] Vidhya.”

This is the crux of a Masked Language Model. The authors of BERT also include some caveats to further improve this technique:

To prevent the model from focusing too much on a particular position or tokens that are masked, the researchers randomly masked 15% of the words
The masked words were not always replaced by the masked tokens [MASK] because the [MASK] token would never appear during fine-tuning
So, the researchers used the below technique:
80% of the time the words were replaced with the masked token [MASK]
10% of the time the words were replaced with random words
10% of the time the words were left unchanged

I have shown how to implement a Masked Language Model in Python in one of my previous articles here:

Introduction to PyTorch-Transformers: An Incredible Library for State-of-the-Art NLP (with Python code)

b. Next Sentence Prediction

Masked Language Models (MLMs) learn to understand the relationship between words.

Additionally, BERT is also trained on the task of Next Sentence Prediction for tasks that require an understanding of the relationship between sentences.

A good example of such a task would be question answering systems.

The task is simple. Given two sentences — A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence?

Since it is a binary classification task, the data can be easily generated from any corpus by splitting it into sentence pairs. Just like MLMs, the authors have added some caveats here too. Let’s take this with an example:

Consider that we have a text dataset of 100,000 sentences. So, there will be 50,000 training examples or pairs of sentences as the training data.

For 50% of the pairs, the second sentence would actually be the next sentence to the first sentence
For the remaining 50% of the pairs, the second sentence would be a random sentence from the corpus
The labels for the first case would be ‘IsNext’ and ‘NotNext’ for the second case

And this is how BERT is able to become a true task-agnostic model.

It combines both the Masked Language Model (MLM) and the Next Sentence Prediction (NSP) pre-training tasks.

Implementing BERT for Text Classification in Python

Your mind must be whirling with the possibilities BERT has opened up. There are many ways we can take advantage of BERT’s large repository of knowledge for our NLP applications.

One of the most potent ways would be fine-tuning it on your own task and task-specific data. We can then use the embeddings from BERT as embeddings for our text documents.

In this section, we will learn how to use BERT’s embeddings for our NLP task. We’ll take up the concept of fine-tuning an entire BERT model in one of the future articles.

For extracting embeddings from BERT, we will use a really useful open source project called Bert-as-Service:

Running BERT can be a painstaking process since it requires a lot of code and installing multiple packages. That’s why this open-source project is so helpful because it lets us use BERT to extract encodings for each sentence in just two lines of code.

Installing BERT-As-Service

BERT-As-Service works in a simple way. It creates a BERT server which we can access using the Python code in our notebook. Every time we send it a sentence as a list, it will send the embeddings for all the sentences.

We can install the server and client via pip. They can be installed separately or even on different machines:

$ pip install bert-serving-server # server 
$ pip install bert-serving-client # client

Also, since running BERT is a GPU intensive task, I’d suggest installing the bert-serving-server on a cloud-based GPU or some other machine that has high compute capacity.

Note that the server MUST be running on Python >= 3.5 with TensorFlow >= 1.10 ( one-point-ten).

Now, go back to your terminal and download a model listed below. Then, uncompress the zip file into some folder.

Here’s a list of the released pre-trained BERT models:

We’ll download BERT Uncased and then decompress the zip file:

$ wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip && unzip uncased_L-12_H-768_A-12.zip

Once we have all the files extracted in a folder, it’s time to start the BERT service:

$ bert-serving-start -model_dir uncased_L-12_H-768_A-12/ -num_worker=2 -max_seq_len 50

You can now simply call the BERT-As-Service from your Python code (using the client library). Let’s just jump into code!

Open a new Jupyter notebook and try to fetch embeddings for the sentence: “I love data science and analytics vidhya”.

Here, the IP address is the IP of your server or cloud. This field is not required if used on the same computer.

The shape of the returned embedding would be (1,768) as there is only a single sentence which is represented by 768 hidden units in BERT’s architecture.

Problem Statement: Classifying Hate Speech on Twitter

Let’s take up a real-world dataset and see how effective BERT is. We’ll be working with a dataset consisting of a collection of tweets that are classified as being “hate speech” or not.

You can download the dataset and read more about the problem statement on the DataHack platform.

For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

We will use BERT to extract embeddings from each tweet in the dataset and then use these embeddings to train a text classification model.

Here is how the overall structure of the project looks like:

Let’s look at the code now:

You’ll be familiar with how most people tweet. There are many random symbols and numbers (aka chat language!). Our dataset is no different. We need to pre-process it before passing it through BERT:

Now that the dataset is clean, it’s time to split it into training and validation set:

Let’s get the embeddings for all the tweets in the training and validation sets:

It’s model building time! Let’s train the classification model:

Check the classification accuracy:

Even with such a small dataset, we easily get a classification accuracy of around 95%. That’s damn impressive.

I encourage you to go ahead and try BERT’s embeddings on different problems and share your results in the comments below.

In the next article, I plan to take a BERT model and fine-tune it fully on a new dataset and compare its performance.

Beyond BERT: Current State-of-the-Art in NLP

BERT has inspired great interest in the field of NLP, especially the application of the Transformer for NLP tasks. This has led to a spurt in the number of research labs and organizations that started experimenting with different aspects of pre-training, transformers and fine-tuning.

Many of these projects outperformed BERT on multiple NLP tasks. Some of the most interesting developments were RoBERTa, which was Facebook AI’s improvement over BERT and DistilBERT, which is a compact and faster version of BERT.

You can read more about these amazing developments regarding State-of-the-Art NLP in this article.

Originally published at https://www.analyticsvidhya.com on September 25, 2019.