What is BERT? How it is trained ? A High Level Overview

Suraj Yadav
8 min readJul 13, 2023

--

https://cdn-images-1.medium.com/max/1500/1*g1KBCVCITjrd9IJ7AyFqdw.png

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a groundbreaking model in the field of natural language processing (NLP) and deep learning. It was introduced by researchers at Google in 2018 and has since become one of the most influential and widely used models in NLP.

BERT is a type of transformer-based neural network architecture that learns contextualized word representations by leveraging the bidirectional nature of language. Unlike previous models that only consider the surrounding words in a unidirectional manner, BERT can capture the context from both the left and right sides of a given word. This bidirectional approach allows BERT to better understand the nuances and dependencies within a sentence or a paragraph.

The core idea behind BERT is pre-training and fine-tuning. In the pre-training phase, BERT is trained on a massive amount of unlabeled text data, such as books, articles, and web pages. During this phase, the model learns to predict missing words in a sentence by considering the surrounding words. This process enables BERT to acquire a deep understanding of the language’s syntactic and semantic structures.

After pre-training, BERT is fine-tuned on specific downstream tasks, such as text classification, named entity recognition, question answering, and sentiment analysis. In the fine-tuning phase, BERT is trained on labeled data for the target task, allowing it to adapt its learned representations to the specific requirements of that task. By transferring the knowledge gained during pre-training to these downstream tasks, BERT has shown remarkable performance and achieved state-of-the-art results in various NLP benchmarks.

The architecture of BERT is based on the Transformer model. The Transformer model uses self-attention mechanisms to capture relationships between words in a sentence without relying on sequential processing, making it highly parallelizable and efficient.

https://media.geeksforgeeks.org/wp-content/uploads/20230522175845/elmo-eemmbeddings-(1).jpg

BERT, like the Transformer model, consists of an encoder stack. However, unlike traditional Transformer models that are unidirectional, BERT is bidirectional, allowing it to leverage information from both left and right contexts of a word. This bidirectional nature is a key aspect that sets BERT apart and enables it to capture richer semantic representations.

Training of BERT

Training BERT involves two main stages: pre-training and fine-tuning. In the pre-training stage, BERT is trained on a large corpus of unlabeled text data, while in the fine-tuning stage, it is adapted to specific downstream tasks using labeled data. Let’s explore each stage.

Pre-training:

The pre-training stage of BERT involves training the model on a vast amount of unlabeled text data to learn general language representations. The primary objective of pre-training is to enable BERT to capture the contextual relationships between words and sentences.

  1. Tokenization: The text is then tokenized into sub word units using a process called WordPiece tokenization. This breaks down words into smaller meaningful sub word units, such as “un” and “##happy” for the word “unhappy,” which helps handle out-of-vocabulary words.
  2. Masked Language Model (MLM):

One of the key techniques used in pre-training BERT is the Masked Language Model (MLM). In this approach, a certain percentage of the input tokens are randomly selected (usually around 15% of the tokens) and masked (replaced with a [MASK] token). The objective is for BERT to predict the original masked tokens based on the context provided by the surrounding tokens.

https://editor.analyticsvidhya.com/uploads/13216BERT_MLM.png

The objective of the MLM is to predict the original masked words based on the surrounding context. The model receives the masked input sentence as an input and learns to predict the original masked words by minimizing the prediction error. During training, BERT is trained to predict the masked words from the representations of the other non-masked words in the sentence. This creates a bidirectional context, as the model can use both the left and right contexts to infer the masked word. The model utilizes a multi-layer bidirectional Transformer architecture, which allows it to capture contextual information from both directions.

Suppose we have the following sentence: “The quick brown fox jumps over the lazy dog.

The sentence is tokenized into subword units using WordPiece tokenization. After tokenization, we have the following tokens: [“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “.”].

  • Let’s say we choose a masking probability of 15%. We randomly select some tokens to be masked. In this example, we’ll mask three tokens: “fox,” “the,” and “dog”.
  • After masking, the input sentence becomes: “The quick brown [MASK] jumps over [MASK] lazy [MASK] .”

BERT takes the masked sentence as input and learns to predict the original masked words. The model receives the following input: [“The”, “quick”, “brown”, “[MASK]”, “jumps”, “over”, “[MASK]”, “lazy”, “[MASK]”, “.”]

During training, BERT uses its bidirectional Transformer architecture to predict the original masked words.

  • For instance, given the input, BERT predicts the original masked words as follows: [“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “.”].

The model learns by minimizing the prediction error between the predicted masked words and the original masked words.

3. Next Sentence Prediction (NSP):

Another important task during pre-training is Next Sentence Prediction (NSP). BERT is trained to predict whether two sentences appear consecutively or not in the original text. It helps BERT capture context and build a deeper understanding of language by learning to predict whether a given pair of sentences in a training dataset appear consecutively in the original text or are randomly sampled.

The training process involves creating training examples by pairing sentences from a large corpus of text. For each training example, two sentences are chosen: a “context” sentence and a “next” sentence. The context sentence is selected from the corpus, and the next sentence can be either the sentence that immediately follows the context sentence in the original text or a randomly sampled sentence from the corpus.

To train BERT with NSP, the input to the model is a concatenation of the two sentences with a special separator token ([SEP]) in between them. Additionally, a special classification token ([CLS]) is inserted at the beginning of the input. The entire input sequence is then passed through multiple layers of transformer encoders.

https://media.geeksforgeeks.org/wp-content/uploads/20210701233612/BERT2sentence-660x551.JPG

During training, BERT learns to predict whether the next sentence follows the context sentence by performing a binary classification. This is achieved by adding a binary classification layer on top of the BERT model. The classification layer takes the final hidden state representation of the [CLS] token and outputs a probability score indicating whether the next sentence is indeed the correct continuation of the context sentence or not.

Suppose we have a training corpus consisting of various pairs of sentences. Let’s consider the following two sentences as an example:

Sentence 1: “The cat is sitting on the mat.”

Sentence 2: “It is a sunny day outside.”

To train BERT with NSP, we need to create a training example by pairing these two sentences. We also need to determine whether Sentence 2 follows Sentence 1 or if it is a randomly sampled sentence.

First, we assign the role of the “context” sentence to Sentence 1. In this case, Sentence 1 becomes the context sentence: “The cat is sitting on the mat.” Next, we randomly select Sentence 2 to be the “next” sentence. In our example, Sentence 2 is “It is a sunny day outside.”

To construct the input for BERT, we concatenate the two sentences with a special separator token ([SEP]) in between, like this:

[CLS] The cat is sitting on the mat. [SEP] It is a sunny day outside. [SEP]

The [CLS] token is a special token inserted at the beginning of the input sequence to represent the entire input for classification tasks.

The input sequence is then passed through multiple layers of transformer encoders in the BERT model, allowing it to capture the relationships between the tokens and learn contextual representations.

The final hidden state representation of the [CLS] token is used for the NSP task. A binary classification layer is added on top of the BERT model to predict whether Sentence 2 is the correct continuation of Sentence 1.

During training, the model is provided with a large number of such sentence pairs, where some pairs are consecutive in the original text, while others are randomly sampled. For each pair, the model learns to predict whether the next sentence follows the context sentence or not. For instance, during training, the model is aware that in our example, Sentence 2 (“It is a sunny day outside.”) does not follow Sentence 1 (“The cat is sitting on the mat.”). Therefore, the model’s objective is to correctly predict that these sentences are not consecutive.

BERT optimizes two training objectives simultaneously: MLM and NSP. These objectives are combined and optimized using a large corpus of unlabeled text data, such as books, articles, and web pages. The training process involves iteratively adjusting the model’s parameters to minimize the loss function associated with the predicted masked tokens and next sentence predictions.

Fine-tuning:

After pre-training, BERT is fine-tuned on specific downstream tasks using labeled data. Fine-tuning involves adapting the pre-trained BERT model to the target task, enabling it to make predictions and solve specific NLP problems.

For fine-tuning, task-specific labeled data is required. This data is typically smaller in size compared to the pre-training data but contains annotations or labels specific to the target task, such as sentiment labels, question-answer pairs, or named entity annotations.

In the fine-tuning stage, BERT’s architecture is typically modified to suit the specific task at hand. This modification may involve adding task-specific layers or adjusting the model’s input and output structures to match the requirements of the target task.

During fine-tuning, BERT’s parameters are adjusted using the task-specific labeled data. The model is trained to minimize a task-specific loss function, which varies depending on the target task. For example, in text classification, the loss function may be categorical cross-entropy, while in named entity recognition, it could be a token-level binary classification loss.

One of the key advantages of BERT is its ability to transfer knowledge gained during pre-training to the target task. By initializing the fine-tuning process with the pre-trained weights, BERT can leverage the general language representations it learned during pre-training. This transfer learning approach significantly reduces the amount of labeled data required for training and helps achieve better performance on various NLP tasks.

Overall, BERT is trained through a two-stage process: pre-training and fine-tuning. Pre-training involves training on a large corpus of unlabeled text data using MLM and NSP objectives, enabling BERT to learn contextualized word representations. Fine-tuning then adapts the pre-trained BERT model to specific downstream tasks using task-specific labeled data, optimizing task-specific loss functions. This combination of pre-training and fine-tuning allows BERT to excel in understanding and solving a wide range of NLP problems.

--

--