Understanding BERT: The Power Behind Modern NLP

How BERT Understands Language Context

Archit Saxena

Published in

Analytics Vidhya

7 min readSep 16, 2024

Introduction

In our previous discussion on attention mechanisms, we explored how these techniques enable models to focus on different parts of an input sequence, enhancing their contextual understanding. Building on these concepts, BERT (Bidirectional Encoder Representations from Transformers), introduced by Google, represents a major leap forward in NLP. By leveraging bidirectional context and the Transformer architecture, BERT offers a deeper understanding of language, setting new benchmarks for various NLP tasks.

We will be using the figures from the paper and Jay Alammar’s blog.

The BERT Model

BERT builds on the attention mechanisms discussed earlier by employing a Transformer-based architecture, specifically utilizing the encoder part of the Transformer. Here’s how BERT enhances language understanding:

Bidirectional Context: Unlike traditional models that process text in a unidirectional manner (left-to-right or right-to-left), BERT reads text in both directions simultaneously. This bidirectional approach allows BERT to capture context from both sides of a word, providing a richer understanding of its meaning. For instance, in the sentence “The bank closed early because of the flood,” BERT can accurately interpret “bank” as a financial institution rather than a riverbank due to its ability to consider surrounding words in both directions.

Transformer Architecture: Central to BERT’s capabilities is the self-attention mechanism within the Transformer model. This mechanism allows BERT to weigh the importance of each word in a sentence relative to the others. By computing attention scores, BERT can effectively focus on relevant words, enhancing its contextual comprehension and handling complex language patterns.

BERT Architecture

The paper presents two model sizes for BERT:

BERT BASE and BERT LARGE Encoder stack; Source: blog

BERT BASE (L=12, H=768, A=12, Total Parameters=110M)
BERT LARGE (L=24, H=1024, A=16, Total Parameters=340M)

where L = Transformers layers (encoder layers), H = Feedforward-networks, A = self-attention heads

For comparison, the default configuration of the Transformer in the Attention is All You Need paper is L=6, H=512, A=8.

Model Inputs

To make BERT handle a variety of down-stream tasks, the input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., <Question, Answer>) in one token sequence.

A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.

WordPiece embeddings are used with a 30,000-token vocabulary.
Each sequence begins with a special classification token [CLS].
The final hidden state of the [CLS] token is used as the sequence’s representation for classification tasks.
Sentence pairs are combined into a single sequence.
Two methods are used to distinguish between the sentences:
- A special separator token ([SEP]) is inserted between them.
- A learned embedding is applied to each token to indicate whether it belongs to sentence A or sentence B.
The input embedding is represented as E.

BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings; Source: paper

Similar to the vanilla transformer encoder, BERT processes a sequence of words as input that moves up through the layers. Each layer applies self-attention, passes the output through a feed-forward network, and then forwards it to the next encoder layer.

Pre-training and Fine-tuning

BERT’s effectiveness is largely due to its innovative approach to pre-training and fine-tuning:

Pre-training:

1. Masked Language Modeling (MLM):

During pre-training, BERT masks random 15% words in a sentence and learns to predict these masked words based on the context provided by surrounding words. This task encourages the model to develop a deeper understanding of language context and relationships.

BERT masks 15% of words in the input and asks the model to predict the missing word.; Source: blog

The training data generator chooses 15% of the token positions at random for prediction. When a token is chosen, it does one of the following: replaces the token with the [MASK] token 80% of the time, swaps it with a random word 10% of the time, and leaves it unchanged 10% of the time (The purpose of this is to bias the representation towards the actual observed word).

# For eg, for the unlabeled sentence:
my dog is hairy

# 80% of the time: Replace the word with the [MASK] token:
my dog is hairy → my dog is [MASK]

# 10% of the time: Replace the word with a random word:
my dog is hairy → my dog is apple

# 10% of the time: Keep the word unchanged:
my dog is hairy → my dog is hairy

2. Next Sentence Prediction (NSP):

BERT also learns to predict whether one sentence follows another in a text. This helps the model grasp the relationship between sentences, improving its performance on tasks that require an understanding of sentence pairs.

The second task BERT is pre-trained on is a two-sentence classification task.; Source: blog

When choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext).

Fine-tuning:

After pre-training, BERT is fine-tuned on specific tasks like sentiment analysis or question answering. This phase adapts the general language understanding developed during pre-training to the particular requirements of the task at hand, enabling BERT to excel in a wide range of NLP applications.

The BERT paper shows several ways to use BERT for different tasks.

Illustrations of Fine-tuning BERT on Different Task.; Source: paper

Feature-based Approach with BERT

Fine-tuning isn’t the only method for utilizing BERT. Similar to ELMo, we can leverage pre-trained BERT to generate contextualized word embeddings and then input these embeddings into your existing model. The paper demonstrates that this approach produces results that are quite close to fine-tuning BERT, particularly for tasks like named entity recognition (NER).

Which vector works best as a contextualized embedding likely depends on the specific task. The paper explores six options (compared to the fine-tuned model, which scored 96.4).

NER results. Hyperparameters were selected using the Dev set. The reported Dev scores are averaged over 5 random restarts using those hyperparameters.; Source: blog

Impact on NLP — Why BERT is a Milestone in NLP Evolution

BERT has significantly impacted the field of NLP by setting new performance standards and demonstrating the power of bidirectional context:

Enhanced Performance: BERT has achieved state-of-the-art results on multiple NLP benchmarks, including the Stanford Question Answering Dataset (SQuAD) and the General Language Understanding Evaluation (GLUE) benchmark. Its ability to understand context more deeply has led to notable improvements in accuracy and reliability.

Advancing Transfer Learning: BERT’s approach to pre-training and fine-tuning has demonstrated the effectiveness of transfer learning in NLP. Models pre-trained on large datasets can be fine-tuned on specific tasks with relatively small amounts of data, making them highly versatile and efficient.
Inspiration for Variants: BERT’s success has inspired the development of various derivatives, such as RoBERTa, ALBERT, and DistilBERT. These models build on BERT’s principles, offering optimizations for different tasks and computational constraints.

Real-World Applications of BERT

Search Engines (Google Search): According to an October 2019 announcement, Google began using BERT to improve search by better understanding complex queries. For example, BERT helps with queries like “Can you get medicine for someone at a pharmacy?” by grasping contextual details. The current use of search in Google Search may have evolved since then.
Question Answering Systems: BERT enhances virtual assistants and chatbots by providing accurate answers to user questions through improved context understanding.
Sentiment Analysis: BERT helps analyze sentiment in reviews and feedback, aiding businesses in grasping customer opinions and emotions.
Text Classification: BERT is used for categorizing text in legal documents, spam filtering, and news classification.
Named Entity Recognition (NER): BERT identifies entities like names and locations, benefiting sectors such as finance and healthcare.
Language Translation and Summarization: BERT-inspired models are used for summarizing documents and translating text based on context.
Healthcare: BERT aids in processing medical texts, classifying documents, and identifying drug interactions.

Conclusion

BERT’s integration of bidirectional context and Transformer-based self-attention mechanisms represents a significant advancement in NLP. Its versatility, as demonstrated by its wide range of applications in search engines, virtual assistants, healthcare, and more, has set a new standard for how we approach language understanding tasks. As we continue to explore the capabilities of NLP models, BERT remains a key milestone in the evolution of language understanding technologies.

If you like it, please leave a 👏.

Feedback/suggestions are always welcome.