#30DaysOfNLP

NLP-Day 25: NLP’s Best Friend. BERT

Introducing Bidirectional Encoder Representations from Transformers

Marvin Lanhenke
5 min readMay 1, 2022

--

NLP’s Best Friend BERT #30DaysOfNLP [Image by Author]

In the previous episode, we concluded our deep dive into the subject of Transformers by implementing a positional embedding layer with the Keras API.

However, we’re not quite done yet and stay for a little longer in the world of transformer-based models.

In the following sections, we’re going to meet a new friend. BERT. We will not only learn why BERT is useful but also how it makes use of an attention mechanism to learn and generate a language model.

So sharpen your pencils, prepare your best friend’s book, and make sure to follow #30DaysOfNLP: NLP’s Best Friend. BERT

My name is BERT

Bidirectional Encoder Representations from Transformers to be specific.

BERT is a language representation model that is designed to pre-train deep bidirectional representations from unlabeled text. And the keyword here is bidirectional.

BERT is able to account for both context directions, left and right, resulting in a pre-trained model that can be fine-tuned with just one additional output layer to create state-of-the-art models for a variety of tasks. Tasks like question answering and language inference for example.

Both approaches, feature-based and fine-tuning for language model pre-training, have shown their positive effects on improving NLP-related tasks.

However, they suffer from one main deficiency.

A unidirectional approach limits the choice of architectures. For example, a Transformer can only attend to previous tokens, accounting for only one direction of context from left to right. This is suboptimal for sentence-level tasks or question answering where it’s crucial to include both directions.

BERT overcomes this limitation by applying two learning strategies.

A masked language model (MLM), randomly masking some of the input tokens with the learning objective to predict the masked words only based on their context. This allows for a combination of the left and right context direction. And a second strategy involves the prediction of the next sentence.

The building-blocks

The underlying framework consists of two steps: Pre-training and fine-tuning.

During pre-training, the model is trained on unlabeled data over different pre-training tasks. In the fine-tuning phase, the model is initialized with the pre-trained parameters and fine-tuned by making use of labeled data. This approach allows for a unified architecture across a variety of problems.

BERT framework [Image by Author based on Devlin et al. (2018)]

The model’s architecture is defined by a multi-layer bidirectional Transformer based on the vanilla Transformer implementation as described by Vaswani et al. (2017) in the paper Attention Is All You Need. Since BERT’s goal is to generate a language model, only the encoder is needed.

Unlike the traditional left-to-right or right-to-left language models, BERT is pre-trained by using two unsupervised tasks.

Masked LM

In order to properly train a bidirectional representation, without illegal connections that allow each word to indirectly “see itself”, 15% of the input tokens are randomly replaced with a [MASK] token.

This procedure defines the “masked language model” and is often referred to as a Cloze task.

Predicting the masked values [Image by Author]

By adding a classification layer on top of the encoder output, the model now attempts to predict the original value of the masked words by computing the probabilities (softmax function) over each word in the vocabulary. In contrast to denoising auto-encoders, BERT only tries to predict the [MASK] tokens and ignores the non-masked words.

Pre-training BERT with this approach allows for increased context awareness, however, at the cost of a slower converging model.

Note: The masking process is slightly more sophisticated. 15% of the token positions are chosen at random. Next, 80% of the tokens get masked, 10% are replaced by a random token, and the last 10% remain unchanged. This procedure is applied to minimize the effect of the mismatch of pre-training and fine-tuning data where [MASK] tokens don’t appear.

Next Sentence Prediction

Tasks like Question Answering and Natural Language Inference rely on the understanding of the relationships between sentences. Those relationships, however, are not directly captured by language modeling.

BERT overcomes this limitation by receiving sentence pairs as input and learning to predict the next sentence.

During the training stage, 50% of the inputs remain a proper pair, whereas in the other 50% a random sentence from the overall corpus is chosen as the second sentence. The underlying assumption here is that a randomly chosen sentence will be semantically disconnected from the first sentence.

In order for the model to be able to distinguish between both sentences, the input is preprocessed before being presented to the model.

Input preprocessing [Image from Devlin et al. (2018)]

A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is appended at the end of each sentence. In addition to that, a sentence embedding is created, indicating whether a token belongs to sentence A or sentence B. We also create a positional embedding, injecting the positional information of each token.

Now, the entire input sequence flows through the transformer model and the probability of the next sequence (binary classification) is computed by applying a softmax function.

Both strategies Masked LM and Next Sentence Prediction are trained simultaneously with the aim of minimizing the combined loss function.

Fine-tuning BERT

Since BERT can be used for a variety of natural language tasks by only adding an additional output layer, fine-tuning is relatively straightforward and inexpensive to perform.

For each task, we simply plug in the specific inputs and outputs and fine-tune all the parameters in an end-to-end fashion.

Classification tasks, for example, can be solved similar to the Next Sentence Prediction by simply adding a classification layer on top of the Transformer output.

Conclusion

In this article, we introduced a new friend. BERT.

A breakthrough in the field of Natural Language Processing that enables fast and approachable fine-tuning for a wide variety of tasks. We covered the underlying concepts as well as the different unsupervised prediction tasks applied to generate a language model.

However, we did this all in theory. In the next article, it’s time to tinker around and implement our own BERT model.

So take a seat, don’t go anywhere, make sure to follow, and never miss a single day of the ongoing series #30DaysOfNLP.

#30DaysOfNLP

30 stories

--

--

Marvin Lanhenke

Business Analyst. Solutions Architect. Self-Taught. Hands-On. Writing about Software Architecture & Engineering. Say Hi @ linkedin.com/in/marvinlanhenke/