Exploring BERT: Feature extraction & Fine-tuning

Mouna Labiadh
DataNess.AI
Published in
9 min readFeb 13, 2024

An introduction on BERT, one of the first Transformer-based large language models, and examples of how it can be used in common NLP applications.

This post is part of an NLP blog series co-written with Asma Zgolli.

BERT (Bidirectional Encoder Representations from Transformers) [1] is a large language model developed by Google AI in 2018. It is :

  • Pretrained: with two unsupervised tasks: masked language modeling (i.e. prediction of randomly masking words from the input sentence) and next sentence prediction and using BooksCorpus and English Wikipedia (16GB)
  • Bidirectional: deeply fuse the left and right context in the learned representation of each token.

The architecture of BERT is composed of multiple encoder layers that apply self-attention to its input and pass it to the following layer. For instance, the smallest BERT model, BERT BASE, has an architecture composed of 12 encoder layers (cf. Figure 2), 768 hidden units in its feed-forward neural network block, and 12 attention heads.

Figure 1: BERT BASE architecture (Figure adapted from [9])

Input representations

BERT takes as input sequences that are composed of sentences or pairs of sentences (<question, answer> for question-answering tasks). Input sequences are prepared before being fed to the model using WordPiece Tokenizer with a 30k vocabulary size. It works by splitting a word into several subwords (Tokens). The use of subwords instead of words significantly reduces the total vocabulary size (only 30k) and the number of potential out-of-vocabulary (OOV) tokens.

Special tokens are:

  • [CLS] used as the first token of each sequence. It is introduced in the original BERT implementation to use its corresponding final hidden state for classification tasks, where only one vector that represents the entire input sequence is needed to be fed to the classifier.
  • [SEP] used to separate the pair of sentences in the input sequence (e.g. question answering) or as the end token.
  • [PAD] used to represent paddings in the input sentences (empty tokens). The model expects fixed-length sentences as input. A maximum length is thus fixed depending on the dataset. Shorter sentences are padded, whereas longer sentences are truncated. To explicitly differentiate between real tokens and [PAD] tokens, we use an attention mask.
Figure 2: Example of a sentence encoding using BERT (Figure by authors)

Each token will be then encoded by its ID which corresponds to its index in the vocabulary.

Eventually, for each token, the input embedding is the sum of the token, the segmentation, and the positional embeddings. To get the token embedding, a lookup table is used at the embedding layer (as illustrated in Figure 2), where rows represent all possible token IDs in the vocabulary (30k rows for instance) and columns represent the embedding size.

Segmentation embedding is introduced to indicate if a given token belongs to the first or second sentence. Positional embedding indicates the position of tokens in a sentence. By contrast to the original Transformer, BERT learns positional embeddings from absolute ordinal position, instead of using trigonometric functions.

Feature Extraction from Text

BERT can be used out-of-the-box (i.e. at inference) to extract machine-readable data representations from text. Once this is done, applying traditional machine learning or deep learning techniques such as classification or regression becomes a straightforward process.

To create embeddings, we start by adding special tokens to the input text. This preprocessing step is required since the (transformer’s) model was trained on data that follows this organization.

In the code snippet below, we manually add the [CLS] and [SEP] tokens to respectively mark the start of a sentence and when a new sentence begins. After that, we create separate tokens by dividing a sentence into words or partial words using the tokenize method from a pre-trained BERT tokenizer. Finally, we add the [PAD] token to a fixed size equal to a pre-set maximum sentence length.

from transformers import BertModel, AutoTokenizer
import torch

# load "bert-base-cased" the pre-trained model
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = False)

# load the corresponding wordtokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


# example sentence
text = "Let’s deep dive into BERT."

# add [CLS] and [SEP] tokens
text = "[CLS]"+text.replace('\.', '[SEP]')

# get tokens from the sentence
tokens = tokenizer.tokenize(
text=text,
)

# set the maximum length of a sentence
max_length = 30

# add [PAD] tokens to shorter sentences
padded_tokens = tokens + ['[PAD]' for _ in range(max_length-len(tokens))]

# get token IDs
token_ids = tokenizer.convert_tokens_to_ids(padded_tokens)

# generate the attention mask
attention_mask = [1 if token != '[PAD]' else 0 for token in padded_tokens]

# whether each token belongs to sentence A (0) or sentence B (1)
segment_ids = [0 for _ in range(len(padded_tokens))]

# convert lists to tensors
token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)
segment_ids = torch.tensor(segment_ids).unsqueeze(0)

To retrieve the input embeddings extracted from the token IDs:

# get input embeddings
input_embeddings = model.embeddings(token_ids, token_type_ids=segment_ids)

Two types of embeddings can be extracted from the input text.

  • Token-level embeddings: these are the embeddings generated by default by the last hidden states of BERT. It consists of representing every token in a separate embedding. Token-level embeddings are used for question-answering or named entity recognition.
  • Sequence-level embeddings: This requires a pooling post-processing of generated token-level embeddings. We extract same-length embeddings for each sentence in the text [3]. Pooling can be performed by taking the mean of token embeddings (most used), the max, or by simply taking the output of the first token of each sequence (CLS token output) [4]. Sequence-level embeddings can be useful for text classification and sentiment analysis applications.
# set the model in evaluation mode
model.eval()

# get contextual embeddings
with torch.no_grad():
# output of shape <batch_size, max_length, embedding_size>
last_hidden_states = model(token_ids, attention_mask=attention_mask, token_type_ids=segment_ids)["last_hidden_state"]

# first token embedding of shape <1, hidden_size>
first_token_embedding = last_hidden_states[:,0,:]

# pooled embedding of shape <1, hidden_size>
mean_pooled_embedding = last_hidden_states.mean(axis=1)

In this section, we presented in detail how to generate text embeddings using BERT. For demonstration reasons, we manually added special tokens to our text. However, it’s possible to automate this pre-processing step in HuggingFace transformers library by making use of encode_plus function :

# encode the sentence
encoded = tokenizer.encode_plus(
text=sentence,
add_special_tokens=True, # add [CLS] and [SEP] tokens
max_length = 30, # set the maximum length of a sentence
truncation = True, # truncate longer sentences to max_length
padding='max_length', # add [PAD] tokens to shorter sentences
return_attention_mask = True, # generate the attention mask
return_tensors = 'pt', # return encoding results as PyTorch tensors
)

# get the token IDs and attention mask
token_ids = encoded['input_ids']
attention_mask = encoded['attention_mask']

Fine-tuning for downstream tasks

As mentioned earlier, BERT can be used out-of-the-box as a feature extractor to generate text embeddings. However, to achieve the best performance, it should be fine-tuned on the target dataset.

Fine-tuning is a transfer learning technique that allows one to solve target problems by applying a model trained on different source tasks or datasets. In the context of NLP applications, the models used are already trained on large text corpus thus they have general language understanding, and they capture complex semantic relationships in texts. Previous research initiatives have proven that fine-tuning existing LLMs leads to better results with fewer data and in less time than training from scratch a new neural network like a Transformer [5].

For task-specific fine-tuning, we add a trainable fully connected layer to the architecture and only a few epochs are needed to optimize it and obtain the desired result.

Hereafter, we briefly present fine-tuning strategies for the three most common use cases: sequence classification, token classification, and question answering.

  • Sequence classification: we simply add a classification layer that takes as input the sequence-level embedding and outputs the class label.
    It can be used on a single text sequence (e.g. sentiment analysis and topic classification), or on a pair of text sequences (e.g. natural language inference and semantic textual similarity). The difference between the two cases is shown in Figures 3 and 4.
Figure 3: Fine-tuning BERT for sequence classification. As pooled sequence-level embedding, we take the embedding of the [CLS] token.
Figure 4: Fine-tuning BERT for sequence-pair classification.
  • Token classification: the added classification layer takes as inputs token embeddings and outputs each token’s class label (cf. Figure 5). Typical applications of token classification are named entity resolution and part-of-speech tagging.
Figure 5: Fine-tuning BERT for token classification.
  • Question answering: takes as input two text sequences, where the first one is the question and the second one is the passage that the question is about and its context (cf. Figure 6). The answer is a segment (text span) of this passage. The output layer will learn the probabilities of each token in the passage of being the start and the end index of the answer. The predicted answer is the one spanning between the start/end tokens with respectively the largest start/end probabilities.
Figure 6: Fine-tuning BERT for question-answering

BERT variants

Several BERT variations exist. Hereafter, we summarize some examples.

RoBERTa is a largely used BERT-like model. RoBERTa uses a larger corpus for pretraining and adds dynamic masking of sentences at training time instead of processing time as in the original BERT for language masking pretraining. Several Pretrained RoBERTa models are available such as CamemBERT for French, RuBERTa for Russian and XLM-RoBERTa for multilingual.

DistilBERT [6] and distilRoBERTa are applications of knowledge distillation mechanism for BERT models, i.e. a compression technique to effectively reproduce the bahavior of a large model in a smaller and hence faster and cheaper one. This makes them more suitable to be deployed on edge devices (mobile phones, …)

Sentence-BERT (sBERT) [7] alleviates the need to pass two sentences simultaneously for each semantic similarity evaluation using Siamese/Triplet networks. BERTopic [8] is used for topic modeling tasks. Broadly, It uses sBERT to embed documents, and clusters semantically similar ones together. Given its modularity, BERTopic can use other document-embedding language models than sBERT to generate embeddings.

Key Takeaways

BERT uses the encoder part of the Transformer architecture to generate contextualized embeddings. The bidirectional pre-training allows it to capture both the left and right context of words for language understanding. In general, encoder-based models are used for predictive modeling tasks like text classification.

Other popular LLMs for tackling related problems are decoder-based models like GPT and encoder-decoder-based models like BART. Decoder-based models are optimized for next-word prediction using causal/autoregressive attention. Hence, they are more adapted for text generation tasks. The third category, which combines the encoder and the decoder parts, has a more generalist approach and seeks to combine the advantages of both categories.

Thank you for reading!

References

[1] Devlin, Jacob, et al. “BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018).

[2] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

[3] Choi, Hyunjin, et al. “Evaluation of bert and albert sentence embedding performance on downstream nlp tasks.2020 25th International conference on pattern recognition (ICPR). IEEE, 2021.

[4] Reimers, Nils, and Iryna Gurevych. “Sentence-BERT: Sentence embeddings using siamese BERT-networks.” arXiv preprint arXiv:1908.10084 (2019).

[5] Chris McCormick and Nick Ryan. (2019, July 22). “BERT Fine-Tuning Tutorial with PyTorch”, from ”http://www.mccormickml.com"

[6] Sanh, Victor, et al. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” arXiv preprint arXiv:1910.01108 (2019).

[7] Reimers, Nils, and Iryna Gurevych. “Sentence-bert: Sentence embeddings using siamese bert-networks.” arXiv preprint arXiv:1908.10084 (2019).
Python library: https://www.sbert.net/

[8] Grootendorst, Maarten. “BERTopic: Neural topic modeling with a class-based TF-IDF procedure.” arXiv preprint arXiv:2203.05794 (2022).
Python library: https://maartengr.github.io/BERTopic/index.html

[9] Khalid, Usama & Beg, Mirza & Arshad, Muhammad. RUBERT: A Bilingual Roman Urdu BERT Using Cross Lingual Transfer Learning. (2021).

Good Reads

--

--

Mouna Labiadh
DataNess.AI

Data scientist | PhD | Machine Learning | Deep Learning | Time Series