Understanding BERT Model

6 min readFeb 10, 2024

BERT has set new benchmarks in the NLP domain by delivering unparalleled performance across numerous tasks. This chapter kicks off with an introduction to BERT, highlighting its uniqueness and how it stands out from previous embedding techniques. Next, we’ll dive into how BERT functions and explore its configurations thoroughly.

Following that, we’ll delve into BERT’s pre-training phase, focusing on two specific tasks: masked language modeling and next sentence prediction, explaining each in detail. We’ll also examine the steps involved in BERT’s pre-training process.

To conclude the chapter, we will investigate various subword tokenization methods that BERT utilizes, such as byte pair encoding, byte-level byte pair encoding, and WordPiece, shedding light on their significance and application.

Understanding the Basics of BERT
How BERT Operates
BERT Configurations
Pre-training the BERT Model
The Pre-training Process
Subword Tokenization Technique

Understanding the Basics of BERT

BERT, short for Bidirectional Encoder Representations from Transformers, represents a groundbreaking model in natural language processing (NLP) introduced by Google. This state-of-the-art model has significantly advanced NLP, achieving impressive outcomes in a variety of tasks, including question answering, text generation, and sentence classification, among others. A key factor behind BERT’s success is its ability to understand the context of words in a sentence, setting it apart from other well-known context-free embedding models like word2vec.

Bidirectional Encoder Representations from Transformers, or BERT, is built upon the transformer architecture, focusing exclusively on the encoder component. Essentially, BERT functions similarly to a transformer but operates solely with the encoder part.

The transformer’s encoder is designed to process a sentence bidirectionally, meaning it can interpret the sentence from both the left and the right. Therefore, BERT essentially represents a bidirectional encoding of information, derived from the Transformer’s architecture.

When a sentence is input into the encoder, it utilizes the multi-head attention mechanism to grasp the context surrounding every word. Consequently, the encoder outputs a contextualized representation for each word in the sentence.

The size of each token’s representation matches that of the encoder layer. For example, if the encoder layer has a size of 768, then every token’s representation will also be 768 in size. Size of the encoder layer can vary depending on the specific configuration of the model. Different versions of models, such as those within the BERT or Transformer architectures, may use encoder layers of different sizes to balance between computational efficiency and the ability to capture complex patterns in the data.

BERT Configurations

The creators of BERT introduced the model in two primary configurations:

BERT-Base
BERT-Large

Bert-Base :

BERT-Base is designed with 12 encoder layers, with each layer placed sequentially above the previous one. Within these encoders, there are 12 attention heads utilized. Additionally, the feedforward network present in each encoder layer comprises 768 hidden units. Consequently, the representation size produced by BERT-Base is 768.

BERT-Large:

BERT-large consists of 24 encoder layers, each stacked one on top of the other. All the encoders use 16 attention heads. The feedforward network in the encoder consists of 1,024 hidden units. Thus, the size of the representation obtained from BERT-large will be 1,024.

Data Representations:

In Bert Model, data can be represented in below ways :

Token embedding

In token embedding , first tokenize words as shown below :

x = “We are watching movie”

Tokens = [We , are, watching , movie]

We will add [CLS] token in the beginning of sentence and [SEP] at the end of token.

Tokens = [[CLS],We , are, watching , movie,[sep]]

Segment embedding:

Most of the time , we have multiple sentences , so in such cases we need token [sep] after every end of the sentence.

Position embedding:

Since BERT operates using the encoder from the transformer architecture, it requires knowledge of the word positions within a sentence before processing. To provide this positional information, we use a layer known as the positional embedding layer, which assigns a positional embedding to each token in the sentence.

WordPiece tokenizer

BERT utilizes the WordPiece tokenizer, which adopts a subword tokenization approach. To illustrate how the WordPiece tokenizer functions, consider the example sentence:

“I enjoy learning about natural language processing.”

Tokenizing this sentence with the WordPiece tokenizer might result in the following set of tokens:

tokens = [i, enjoy, learn, ##ing, about, natural, language, process, ##ing]

Here, the tokenizer divides certain words into smaller units, using ‘##’ to denote the segmented parts of a word, which helps BERT understand both the word and its root meaning.

The BERT model undergoes pre-training through two specific tasks:

Masked Language Modeling (MLM)
Next Sentence Prediction (NSP)

Masked Language Modeling (MLM):

BERT functions as an auto-encoding language model, which allows it to consider context from both the left and the right sides of a given word to enhance its predictions. In the masked language modeling task, BERT takes an input sentence and randomly obscures 15% of the words. The objective for the model is to accurately guess these masked words by analyzing the context provided by the surrounding unmasked words in the sentence, utilizing information from both directions.

To avoid any discrepancies , we use 80–10–10 rule.

80% we replace token with mask token

10% we replace token with random token

10% we do not change

If we are making sub word , then all words related to this subwords are also masked

Next sentence prediction

Next Sentence Prediction (NSP) is another training strategy employed to enhance the capabilities of the BERT model. NSP is essentially a binary classification task where BERT is given a pair of sentences and must determine if the second sentence logically follows the first. This task helps BERT learn the relationships between consecutive sentences.

After representation layer , we use feed forward layer + softmax function as shown below

Pre Trained Model:

The pre-training process for BERT involves leveraging datasets such as the Toronto BookCorpus and Wikipedia. BERT’s pre-training comprises two key tasks: masked language modeling (also known as the cloze task) and Next Sentence Prediction (NSP). Here’s how we prepare the dataset for these tasks:

Initially, we select two segments of text (sentences) from the dataset. For our example, let’s consider two sentences, A and B. The combined token count of sentences A and B should not exceed 512 tokens. During the selection process, half the time sentence B is chosen as a direct continuation of sentence A, and the other half, it’s chosen as unrelated.

For instance, let’s select:

Sentence A: “The cat sat on the mat”

Sentence B: “It was a sunny day”

We start by tokenizing these sentences using the WordPiece tokenizer. We prepend the [CLS] token at the start and append the [SEP] token at the end of each sentence, resulting in the following tokens:

tokens = [[CLS], the, cat, sat, on, the, mat, [SEP], it, was, a, sunny, day, [SEP]]

Next, we mask 15% of the tokens based on the 80–10–10% rule for masking. If we choose to mask “sunny”, the tokens update to:

tokens = [[CLS], the, cat, sat, on, the, mat, [SEP], it, was, a, [MASK], day, [SEP]]

These tokens are then fed into BERT for training. The model is trained to predict the masked token (“sunny”) and determine whether Sentence B logically follows Sentence A. This way, BERT is simultaneously trained on both masked language modeling and NSP tasks.

Adam optimizer is used here

We’re going to dive into some intriguing subword tokenization methods that are instrumental in forming the vocabulary for text processing. Once we’ve established this vocabulary, it becomes the foundation for tokenizing text data. Here are three subword tokenization algorithms that are frequently employed:

Byte Pair Encoding (BPE)
Byte-Level Byte Pair Encoding
WordPiece

These algorithms are key to dissecting text into subword units, enabling more efficient and flexible handling of language data in various models.

Please refer to read about this

Understanding BERT Model

BERT Configurations

WordPiece tokenizer

Next sentence prediction

Written by Avinash