Text Classification with BERT using Transformers for long text inputs

Published in

Analytics Vidhya

8 min readMay 31, 2020

Bidirectional Encoder Representations from Transformers

Text classification has been one of the most popular topics in NLP and with the advancement of research in NLP over the last few years, we have seen some great methodologies to solve the problem. In this blog, we will solve a text classification problem using BERT (Bidirectional Encoder Representations from Transformers). We will use the Google Play app reviews dataset consisting of app reviews, tagged with either positive or negative sentiment — i.e., how a user or customer feels about the app.

We’ll learn how to fine-tune BERT for sentiment analysis after doing the required text preprocessing (special tokens, padding, and attention masks) and then building a Sentiment Classifier using the amazing Transformers library by Hugging Face!

WHAT IS BERT?

BERT (introduced in this paper) stands for Bidirectional Encoder Representations from Transformers. If you don’t know what most of that means — you’ve come to the right place! Let’s unpack the main ideas:

Bidirectional — to understand the text you’re looking you’ll have to look back (at the previous words) and forward (at the next words)
Transformers — The Attention Is All You Need paper presented the Transformer model. The Transformer reads entire sequences of tokens at once. In a sense, the model is non-directional, while LSTMs read sequentially (left-to-right or right-to-left). The attention mechanism allows for learning contextual relations between words (e.g. his in a sentence refers to Jim).
(Pre-trained) contextualized word embeddings — The ELMO paper introduced a way to encode words based on their meaning/context. Nails has multiple meanings — fingernails and metal nails.

BERT was trained by masking 15% of the tokens with the goal to guess them. An additional objective was to predict the next sentence. Let’s look at examples of these tasks:

Masked LM (MLM)

The idea here is “simple”: Randomly mask out 15% of the words in the input — replacing them with a [MASK] token — run the entire sequence through the BERT attention based encoder and then predict only the masked words, based on the context provided by the other non-masked words in the sequence. However, there is a problem with this naive masking approach — the model only tries to predict when the [MASK] token is present in the input, while we want the model to try to predict the correct tokens regardless of what token is present in the input. To deal with this issue, out of the 15% of the tokens selected for masking:

80% of the tokens are actually replaced with the token [MASK].
10% of the time tokens are replaced with a random token.
10% of the time tokens are left unchanged.

While training the BERT loss function considers only the prediction of the masked tokens and ignores the prediction of the non-masked ones. This results in a model that converges much more slowly than left-to-right or right-to-left models.

Next Sentence Prediction (NSP)

In order to understand the relationship between two sentences, BERT training process also uses the next sentence prediction. A pre-trained model with this kind of understanding is relevant for tasks like question answering. During training, the model gets as input pairs of sentences and it learns to predict if the second sentence is the next sentence in the original text as well.

As we have seen earlier, BERT separates sentences with a special [SEP] token. During training the model is fed with two input sentences at a time such that:

50% of the time the second sentence comes after the first one.
50% of the time it is a random sentence from the full corpus.

BERT is then required to predict whether the second sentence is random or not, with the assumption that the random sentence will be disconnected from the first sentence:

To predict if the second sentence is connected to the first one or not, basically, the complete input sequence goes through the Transformer based model, the output of the [CLS] token is transformed into a 2×1 shaped vector using a simple classification layer, and the IsNext-Label is assigned using softmax.

The model is trained with both Masked LM and Next Sentence Prediction together. This is to minimize the combined loss function of the two strategies — “together is better”.

BERT for long text

One of the limitations of BERT is on the application when you have long inputs because, in BERT, the self-attention layer has a quadratic complexity O(n²) in terms of the sequence length n (see this link). In this post, I followed the main ideas of this paper in order to know how to overcome this limitation, when you want to use BERT over long sequences of text.

Data Preprocessing

You might already know that Machine Learning models don’t work with raw text. You need to convert text to numbers (of some sort). BERT requires even more attention (of course!). Here are the requirements:

Add special tokens to separate sentences and do classification
Pass sequences of constant length (introduce padding)
Create an array of 0s (pad token) and 1s (real token) called attention mask

The Transformers library provides a wide variety of Transformer models (including BERT). It works with TensorFlow and PyTorch! It also includes prebuilt tokenisers that do the heavy lifting for us!

Let’s check our data.

You can use a cased and uncased version of BERT and tokenizer. I’ve experimented with both. The cased version works better. Intuitively, that makes sense, since “BAD” might convey more sentiment than “bad”.

Let’s load a pre-trained BertTokenizer:

tokenizer.tokenize converts the text to tokens and tokenizer.convert_tokens_to_ids converts tokens to unique integers.

Some special tokens added by BERT are: [SEP], [CLS] , [PAD].

BERT understands tokens that were in the training set. Everything else can be encoded using the [UNK] (unknown) token.

To get a better understanding of the text preprocessing part and the code snippets for everything step by step, you can follow this amazing blog by Venelin Valkov.

Format the data for BERT model

In this article as the paper suggests, we are going to segment the input into smaller text and feed each of them into BERT, it means for each row, we will split the text in order to have some smaller text (200 words long each ), for example:

We must split it into a chunk of 200 words each, with 50 words overlapped, just for example:

The following function can be used to split the data:

Applying this function to the review_text column of the dataset would help us get the dataset where every row has a list of string of around 200-word length.

For every 200-length chunk, we extracted a representation vector from BERT of size 768 each

Sentiment Classification with BERT and Hugging Face

We have all building blocks required to create a PyTorch dataset. Let’s discuss all the steps involved further.

Preparing the text data to be used for classification:

This step involves specifying all the major inputs required by BERT model which are text, input_ids, attention_mask and targets.

2. Splitting the data into train and test:

It is always better to split the data into train and test datasets to evaluate the model on the test dataset in the end.

3. Loading the data

We also need to create a couple of data loaders and create a helper function for the same. A sample data loader function can be like this:

4. Loading the BERT Model

There are a lot of helpers that make using BERT easy with the Transformers library. Depending on the task you might want to use BertForSequenceClassification, BertForQuestionAnswering or something else. We’ll use the basic BertModel and build our sentiment classifier on top of it. Let’s load the model:

5. Create the Sentiment Classifier model, which is adding a single new layer to the neural network that will be trained to adapt BERT to our task. Please refer to the SentimentClassifier class in my GitHub repo.

6. Train the model

To reproduce the training procedure from the BERT paper, we’ll use the AdamW optimizer provided by Hugging Face. It corrects weight decay, so it’s similar to the original paper.

The BERT authors have some recommendations for fine-tuning:

Batch size: 16, 32
Learning rate (Adam): 5e-5, 3e-5, 2e-5
Number of epochs: 2, 3, 4

Note that increasing the batch size reduces the training time significantly, but gives you lower accuracy.

7. Evaluating the results

We have achieved an accuracy of almost 90% with basic fine-tuning.

Final Words

The techniques for classifying long documents requires, in most cases, padding to a shorter text, however, as we saw, using BERT with masking techniques, we can still achieve such tasks.

In the paper, another method has been proposed: ToBERT (transformer over BERT. This is something I’ll probably try in the future.)

The complete code can be found on this GitHub repository. The project isn’t complete yet, so, I’ll be making modifications and adding more components to it. Feel free to raise an issue or a pull request if you need my help.

Thanks for reading. Cheers!