Fine-tuning BERT model for arbitrarily long texts, Part 1

Michał Brzozowski

Published in

MIM Solutions Blog

7 min readJan 23, 2024

This is part 1 of our series about fine-tuning BERT:

if you want to read the second part, go to this link,
and if you want to use the code, go to our GitHub.

Models based on the transformers architecture have become a state-of-the-art solution in NLP. The word “transformer” is indeed what the letter “T” stands for in the names of the famous BERT, GPT3, and the massively popular nowadays ChatGPT. The common obstacle while applying these models is the constraint on the input length. For example, the BERT model cannot process texts which are longer than 512 tokens (roughly speaking, one token is associated with one word).

The method to overcome this issue was proposed by Devlin (one of the authors of BERT) in the discussion. In this article, we will describe in detail how to modify the process of fine-tuning a pre-trained BERT model for the classification task. The code is available as open source here.

Overview of BERT classification

Let us start with the description of the three stages in the life of the BERT classifier model:

Model pre-training.
Model fine-tuning.
Application.

Model pre-training

In the first stage, BERT is pre-trained on a large corpus of data in a self-supervised fashion. That is, the training data consists of raw texts only, without human labeling. The model is evaluated by two objectives: guessing the masked word in a sentence and prediction if one sentence comes after another.

Observe that both of these tasks are concerned only with separate sentences and not the entire context. Hence there is no truncation of longer texts. The entire book In Search of Lost Time can be used during pre-training despite having more than 1 200 000 words. It is just done sentence by sentence.

We can load the pre-trained base BERT model using the transformers library:

The warning informs us that the downloaded model must be fine-tuned on the downstream task (in our case, this will be a binary classification of sequences). This step will be described in the following subsection.

We use a similar approach to get the tokenizer:

Note the parameter model_max_length=512 listed above. It is the main obstacle we will work around in this article. Applying this model without modification just truncates every text to 512 tokens. All the information and context in the rest of the document are discarded during fine-tuning and prediction stages.

The most straightforward and natural idea is to divide the text into smaller chunks and feed them to the model separately. It is our strategy; however, as we will see, the devil is in the detail.

Model fine-tuning

Clearly, after reading many books and the entire Wikipedia, the downloaded pre-trained model is knowledgeable. However, its knowledge is very general.

Assume that we only need to predict if the movie review is positive or negative and ignore its vast and intricate wisdom of quantum mechanics and Proust. More importantly, we need to adapt the model for our specific task of binary sequence classification. Assume we want to train the model to recognize that the movie review is positive or negative based on its text.

To do this, we use the supervised learning approach. More precisely, prepare the training set of reviews manually labeled as positive or negative and then feed it to the model with an additional classification layer on top of the model.

Modifying the fine-tuning step to look at the entire text and not just the first 512 tokens turned out to be untrivial and will be described in detail later.

Model application

The last stage is applying the trained model to the new data and obtaining classifications.

Using the fine-tuned classifier on longer texts

It will be instructive first to describe the more straightforward process of modifying the already fine-tuned BERT classifier to apply it to longer texts. This section will be mainly based on the excellent tutorial article:

How to Apply Transformers to Any Length of Text

Restore the power of NLP for long sequences

towardsdatascience.com

The main difference between our approaches here is allowing the chunks of text to overlap.

Finding a long review

In what follows we will consider the well-known dataset of movie reviews from IMDB. We are interested in classifying them based on their sentiment. That is if they are positive or negative.

After basic exploration, we load the dataset from huggingface and find a very long review of David Lynch’s Mulholland Drive:

As we can see, the review is rather elaborate and consists of 2278 words. We want to split it into chunks that are small enough to fit into the 512 limits of the BERT input.

Loading the already fine-tuned BERT classifier

In this section, we will assume that we have an already fine-tuned BERT classifier. Let us download the one trained on the IMDB dataset from huggingface:

Tokenization of the whole text

Now we want to tokenize the entire review:

Observe the following:

We set add_special_tokens to False, because we will add special tokens at the beginning and the end manually after the splitting procedure.
We set truncation to False, because we do not want to throw away any part of the text.
We set return_tensor to “pt”, to get the result in the form of the torch Tensor.

The warning informs us that the tokenized sequence is too long (after tokenization we obtained 3155 tokens which are even significantly more than the number of words). If we just put such a tensor into the model it will not work.

Indeed, let us try it:

What are the tokens?

Let us now take a look at what exactly are these tokens we are referring to.

As we can see, the tokenized text is equivalent to the Python dictionary with the following keys:

input_ids — this part is crucial — it encodes the words as integers. It can also contain some special tokens indicating the beginning (value 101) and the end of the text (value 102). We will add them manually after the splitting procedure.
token_type_ids — this binary tensor is used to separate question and answer in some specific applications of BERT. Because we are interested only in the classification task, we can ignore this part.
attention_mask — this binary tensor indicates the position of the padded indices. Later we will manually add zeroes there to make sure that all chunks have precisely the demanded size of 512.

Splitting the tokens

To fit the tokens into the model, we need to split them into chunks with a length of 512 tokens or less. However, we also need to put 2 special tokens at the beginning and the end; hence the upper bound is 510.

The procedure of splitting and pooling is determined by the hyperparameters. These are maximal_text_length, chunk_size, stride, minimal_chunk_length, and pooling_strategy .

They are used in the following way:

The parameter maximal_text_length is used to truncate the tokens. It can be either None, which means no truncation, or an integer, determining the number of tokens to consider. Standard BERT truncates to 510 tokens because it needs 2 additional tokens at the beginning and the end.
The integer parameter chunk_size determines the size (in several tokens) of each chunk. This parameter cannot be larger than 510. Otherwise, we will not be able to fit the chunk into the input of BERT.
Tokens may overlap depending on the parameter stride.
In other words: we get chunks by moving the window of the size chunk_size by the length equal to stride. A stride cannot be bigger than a chunk size. Chunks must overlap or be near each other.
Stride has an analogous meaning here to that in convolutional neural networks.
The chunk_size is analogous to kernel_size in 1D CNNs.
We ignore chunks that are too small — smaller than minimal_chunk_length. This parameter cannot be set larger than chunk_size.
The string parameter pooling_strategy is used at the end to aggregate the model results. It can be either mean or max.

For clarity, we will demonstrate this procedure with a few examples:

Adding special tokens

After splitting into smaller chunks we must add special tokens at the beginning and the end:

Next, we must add some padding tokens to ensure that all chunks have a size of precisely 512:

Stacking the tensors

After applying this procedure to a single text, the input_ids is a list of K tensors of the size 512, where K is the number of chunks. To put this into the BERT model, we must stack these K tensors into one tensor of the size K x 512 and ensure that tensor values have the appropriate type:

Wrapping it into one function

For convenience, we can wrap all the previous steps into a single function:

Procedure for the selected long review

Let us now combine all the mentioned steps for our example long review. We will use the parameters chunk_size = 510, stride=510 and minimal_chunk_size=1, which means just splitting into non-overlapping parts:

Hence the review was divided into 7 chunks.

Using the fine-tuned model on the prepared data

The prepared data is ready to plug into our fine-tuned classifier:

Let us summarize:

The fine-tuned model returned logit values for each chunk.
We applied the softmax function and slicing to get the probability that the review is positive.
We obtained the list of probabilities for each: [0.9997, 0.9996, 0.5399, 0.9994, 0.9995, 0.9975, 0.9987]
Finally, we can apply some pooling function (mean or maximum) to obtain one aggregated probability for the entire review.

Conclusions

In this part, I presented how to use the already- fine-tuned BERT on arbitrarily long texts. However, what to do when we want to fine-tune it ourselves? I have answered this question in Part 2 of my series.