Evolution of NLP — Part 4 — Transformers — BERT, XLNet, RoBERTa

Using SOTA Transformers models for Sentiment Classification

Published in

Analytics Vidhya

12 min readDec 20, 2020

This is endgame! Transformers are one of the premier deep learning architectures used today, in combination with transfer learning to handle various NLP tasks. We continue our journey to find the best solution for our Sentiment Classification task. We’ve already seen basic techniques — TF-IDF and Bag of Words, then we saw LSTMs, next we applied Transfer Learning along with LSTMs and now we explore an altogether different architecture here!

Before we move ahead, it’s important to get a sense of issues with RNNs/LSTMs.

Issues with LSTMs

LSTMs have been very critical in the evolution of NLP. These were some of the most pivotal architectures that resolved issues within RNNs and made deep learning more wide-spread. But they pose some issues which make them difficult to work with.

Slow to train! RNNs are already slower to train because we need to provide data sequentially while training, instead of parallelly. A hidden state would require inputs from all the previous words to make any progress. This kind of architecture doesn’t take advantage of today’s GPUs which are optimized for parallel processing. Couple this with the added complexity in an LSTM, in the form of multiple gates, they are even slower to train.
Better Contextual Awareness! Normal LSTMs process the words only in one direction, and this limits the contextual awareness of the network. Even the Bi-Directional LSTMs learn the context in forward & backward direction and concatenate them, while a better approach would be to look at both the words ahead and back together!
Long Sequences! We already know that LSTMs can perform better than RNNs due to their gates, but even with this, the improvement in the case of LSTMs is not significant when we are trying tasks that crunch a large number of sentences, for example — Text Summarization and Questions & Answers.

With Transformers, let’s see how we can address these issues.

Transformers

This architecture was first show-cased in Attention is All You Need! (2017) paper. The complete architecture can be split into two parts — Encoder & Decoder. In the attached image, the structure on the left is Encoder and on the right is Decoder

Encoder-Decoder architecture from Attention is All You Need! (2017) Paper

1. Encoder

In a nutshell, an Encoder tries to understand the context of an input sentence, and in-process learn what the language is! Let’s see how data moves in an encoder.

The input to an Encoder is a sentence — say The big red dog. These individual words are first tokenized and replaced by pre-trained word embeddings in the Input Embeddings layer.
In the next step, we add a Position Encoding to individual word vectors. Essentially a function that maps to the position of the word in a sentence. We add this to our initial input embeddings to make sure that apart from the word, we also have a position aspect of it embedded in, while we pass the sentence into the encoder.
Next, we pass this to a Self-Attention Layer. The idea here to create a matrix of importance each word has while defining context for other words in the same sentence. Each word would obviously have significant importance with itself, and then with other words.

The darker the red, the higher the attention each individual word has with all the other words. — Image from Transformers Explained — https://www.youtube.com/watch?v=TQQlZhbC5ps

And, finally, we pass it through a Feed-Forward layer, which is basically a dense neural network layer.

The biggest advantage of this network is that we don’t need to pass the data sequentially but in parallel! This greatly improves the training time. And since, we are looking at the words in an overall manner, without any directional sense, the overall understanding of language or context is much better than other architectures.

To understand BERT and XLNet, you only need to know Encoders. So, if that is your goal, please jump to the section on BERT. I’ll give explaining the Decoders a shot next.

2. Decoder

In a nutshell, a Decoder takes the output sentence as the input and tries to output the next word. Visualize the whole Transformer architecture as a Language Translation Model — input as “The big red dog” in English and output as “Le Gros Chien rogue” in French. Now let's try to see how this output flows in a decoder network.

The first step is similar to Encoder. Embedding, coupled with Positional Encoding for each output word.
Next, we compute Masked Self-Attention. This is slightly different from before, as here we intentionally mask the words that are yet to come. Essentially we don’t provide the words that are supposed to come next while calculating the importance. Now, this is interesting because this is different from what we did in Encoder. Intuitively, while predicting the next French word, we can use the context of all English words, but we can’t use the next french word, because then the model would just output that french word. I know it’s a little confusing, but it’ll become clear as we see how these models are used in practice.
Then, the input from the Decoder network is looked at alongside Encoder’s input. Here, we try to establish the context or relationship between English and French words together, but again with masking for the French words
Finally, as the word travels through the Feed-Forward and Linear layer, we reach the Softmax. The softmax is essentially on all the words in Vocabulary, which gives us a probability score for all the words in the dictionary for the next word. The highest probability word is chosen for each position.

Do note that, again this process happens in parallel, similar to Encoder.

I hope this gives you some insight into the architecture of Transformers, and how they can be an improvement over LSTMs. Let’s try and look at two such architectures & their implementation in detail.

BERT

Bert stands for Bi-directional Encoder Representation from Transformers. As the name implies, this architecture uses the Encoder part of the Transformers network, but the difference is that multiple encoder networks are stacked one after the other. Another important aspect that makes BERT stand out is its training methodology. Let’s try to understand that.

1. Pre-Training

This is the phase where the model understands what is language and context. For this part, BERT uses two simultaneous tasks to train —

Masked Language Modeling — Intuitively, this is like a “fill in the blanks” learning task. The model randomly masks some portion of sentences and its job is to predict those masked words. For example — With input being, “The [MASK] brown fox [MASK] over the lazy dog.”, the output, would be [“quick”, “jumped”].
Next Sentence Prediction — Here, BERT takes in two sentences and determines if the second sentence follows the first one, basically like a Classification Task.

In a nutshell, these two tasks give BERT an understanding of language and context both within a sentence and over multiple sentences.

2. Fine-Tuning

Now the model basically understands what the language and context are, next is to train it for our specific task. In our case, this would be Sentiment Classification. We simply add Dense layers at the end of the network to get output, as per our specific task.

Let’s start with implementing BERT. We’ll look at the familiar fast.ai construct, however since these models are not directly available in fast.ai, we’ll have to create classes for Tokenizers, Vocabulary, and Numericalizers for fast.ai. This is exceptionally confusing, especially if one is not familiar with these libraries.

Note that the implementation below is inspired by this excellent Kernel Tutorial and Medium Blog by the same author. This not only allows you to implement transformers quickly, but also gives tremendously flexibility in trying out other powerful architectures.

That said, I’ll try to share my interpretation of this tutorial below. Feel free to check out this kernel for better clarity. There are other tutorials regarding the same, which I’ll link below, but if you want to settle on a general solution to load multiple types of Transformers Models, I would recommend reading this implementation.

Apart from the fast.ai library, we learned about in the last tutorial, here we’ll additionally use the HuggingFace Transformers library. This library has almost all the major SOTA NLP models, with specific models for NLP tasks, like Question-Answers, Text Summarization, Masked LM, etc. We’ll be using the Sequence Classification models here, but feel free to try out models for other tasks.

Now, to run any model using HuggingFace library requires you to load 3 components —

Model Class — This will help us load the architecture and pre-trained weights for our specific model.
Tokenizer Class — This will help pre-process the data into tokens. Additionally, the padding, start and ending of sentences, and handling missing words in vocabulary is somewhat different across different models.
Config Class — This is the configuration class to store the configuration of the chosen model. It is used to instantiate the model according to the specified arguments, defining the model architecture.

Note that, we need to load these three classes for the same model. For example, in our initial trial, if I’m running BERT, we load BertForSequenceClassification for the model class, BertTokenizer for tokenizer class, and BertConfig for config class.

The next step is Tokenization!

Tokenization

Note that BERT has it’s own vocabulary and tokenizer. Hence, we need to create a wrapper around the BERT’s internal implementation, so that it is compatible with fast.ai’s implementation. This can be done in 3 steps shown below.

First, we create a tokenizer object, by loading the default tokenizer for our specific model.

transformer_tokenizer = tokenizer_class.from_pretrained(pretrained_model_name)

2. We then create our own BertBaseTokenizer Class, where we update the tokenizer function, incorporating functions that help process our specific set of transformers models.

class BertBaseTokenizer(BaseTokenizer):
    """Wrapper around PreTrainedTokenizer to be compatible with fast.ai"""
    def __init__(self, pretrained_tokenizer: PreTrainedTokenizer, model_type = 'bert', **kwargs):
        self._pretrained_tokenizer = pretrained_tokenizer
        self.max_seq_len = pretrained_tokenizer.max_len
        self.model_type = model_typedef __call__(self, *args, **kwargs): 
        return selfdef tokenizer(self, t:str) -> List[str]:
        """Limits the maximum sequence length and add the spesial tokens"""
        CLS = self._pretrained_tokenizer.cls_token
        SEP = self._pretrained_tokenizer.sep_token
        tokens = self._pretrained_tokenizer.tokenize(t)[:self.max_seq_len - 2]
        tokens = [CLS] + tokens + [SEP]
        return tokens

And we initialize our base tokenizer using this —

bert_base_tokenizer = BertBaseTokenizer(pretrained_tokenizer = transformer_tokenizer, model_type = model_type)

3. We are not done yet, and this is the most confusing part. We pass our bert_base_tokenizer into the Tokenizer function which is then processed by fast.ai. This additional step is important so do make sure you do this in your implementation as well.

fastai_tokenizer = Tokenizer(tok_func = bert_base_tokenizer)

4. To use this while loading data, it is recommended to convert this into a TokenizerProcessor. We can call this during DataBunch call, as we had seen in the earlier tutorial

tokenize_processor = TokenizeProcessor(tokenizer=fastai_tokenizer, include_bos=False, include_eos=False)

Numericalization

This step involves encoding the tokens into numerical encodings. Once again, each transformer model has its own version of Vocabulary, and hence, it's very own encoding. Let’s load them as well into the familiar fast.ai framework.

First, let’s create our version of Vocab library in fast.ai, updating its functions — numericalize (converts tokens to encodings) and textify (converts encodings to tokens). In the definitions of these functions, we use — convert_tokens_to_ids and convert_ids_to_tokens functions respectively that work with HuggingFace’s pre-trained transformer models.

TransformersVocab class definition — Image from Author

2. Finally, we pass this into a Numericalize Processor class, similar to Tokenize Processor, which we will call during our DataBunch creation.

transformer_vocab = TransformersVocab(tokenizer = transformer_tokenizer)
numericalize_processor = NumericalizeProcessor(vocab=transformer_vocab)

Loading the Data

Next, we load the data using, DataBlock API.

databunch = (TextList.from_df(train, cols=’user_review’, processor=transformer_processor)
 .split_by_rand_pct(0.1,seed=seed)
 .label_from_df(cols= ‘user_suggestion’)
 .add_test(test)
 .databunch(bs=bs, pad_first=pad_first, pad_idx=pad_idx))

Modeling

Before we start the modeling process, we create our own model using the pre-trained model as the base, only extracting the logits needed for our specific prediction.

Custom Classification Model — Image from Author

And finally, initialize the AdamW optimizer that comes packed with HuggingFace Transformers library.

After this, we can directly run the model, but it might be helpful to introduce another helpful tool while using the fast.ai library.

Gradual Unfreezing

In our previous tutorials, we covered Discriminative Fine Tuning and Slated Triangular Learning Rate. Gradual Unfreezing is also something you can explore while making models.

The idea is quite simple — most of the earlier layers of models like BERT are for the understanding of language — and just like Pre-Trained CNN models, they can be left untouched, and simply tuning the last few layers can lead to good results, pretty quickly.

For models available within fast.ai, like AWD-LSTM, the layers are already present in groups, which can be put as un-trainable. For BERT, we’ll have to create the groups on our own.

list_layers = [learner.model.transformer.bert.embeddings,
              learner.model.transformer.bert.encoder.layer[0],
              learner.model.transformer.bert.encoder.layer[1],
              learner.model.transformer.bert.encoder.layer[2],
              learner.model.transformer.bert.encoder.layer[3],
              learner.model.transformer.bert.encoder.layer[4],
              learner.model.transformer.bert.encoder.layer[5],
              learner.model.transformer.bert.encoder.layer[6],
              learner.model.transformer.bert.encoder.layer[7],
              learner.model.transformer.bert.encoder.layer[8],
              learner.model.transformer.bert.encoder.layer[9],
              learner.model.transformer.bert.encoder.layer[10],
              learner.model.transformer.bert.encoder.layer[11],
              learner.model.transformer.bert.pooler]

These layers can be split into individual groups -

learner.split(list_layers)

Now, to freeze all the layers, except for the last one, we use -

learner.freeze_to(-1)

And, let the model train for 1 epoch.

After this, we sequentially unfreeze another layer —

learner.freeze_to(-2)

And, let the model train for another epoch.

And finally, we unfreeze all the layers and let it run for 5 epochs.

Accuracy — 92%

This gave you a glimpse of using BERT for Sentiment Analysis. But, the HuggingFace Library is not limited to BERT. There have been several new models, which have come up and shown improvements over BERT.

XLNet

XLNet shot up to fame after it beat BERT in roughly 20 NLP tasks, sometimes with quite substantial margins. So, what is XLNet and how is it different from BERT? XLNet has a similar architecture to BERT. However, the major difference comes in it’s approach to pre-training.

BERT is an Autoencoding (AE) based model, while XLNet is an Auto-Regressive (AR). This difference materializes in the MLM task, where randomly masked language tokens are to be predicted by the model. To better understand the difference, let’s consider a concrete example [New, York, is, a, city].
Suppose both BERT and XLNet select the two tokens [New, York] as the prediction targets and maximize log(New, York | is, a, city). Also suppose that XLNet samples the factorization order [is, a, city, New, York]. In this case, BERT and XLNet respectively reduce to the following objective functions:

J{BERT} = log(New | is, a, city) + log(York | is, a, city) and
J{XLNet} = log(New | is, a, city) + log(York | New, is, a, city)

Notice that XLNet is able to capture the dependency between the pair (New, York), which is omitted by BERT. Although in this example, BERT learns some dependency pairs such as (New, city) and (York, city), it is obvious that XLNet always learns more dependency pairs given the same target and contains “denser” effective training signals.

Let’s see for our specific task if XLNet offers any significant improvement.

Overall, the implementation is fairly similar, with minor changes in padding, starting, and ending of the sentences. You can study the code at this kernel

Accuracy — 93%

RoBERTa

Robustly optimized BERT approach — RoBERTa, is a retraining of BERT with improved training methodology, 1000% more data, and compute power.

To improve the training procedure, RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pre-training and introduces dynamic masking so that the masked token changes during the training epochs. Larger batch-training sizes were also found to be more useful in the training procedure.

Importantly, RoBERTa uses 160 GB of text for pre-training, including 16GB of Books Corpus and English Wikipedia used in BERT.

Overall, the implementation is fairly similar, with minor changes in padding, starting, and ending of the sentences. You can study the code at this kernel.

Accuracy — 94%

With a few lines of code, we were able to implement and study SOTA Transformer Models. I hope this series served as a good starter for you to learn these techniques.

With these new innovations, the possibilities of what could be done in NLP have changed significantly over the last few years. These Transformer models are on their way to potentially replace all the existing LSTM/RNN based models! And they are not slowing down — just recently — OpenAI’s GPT-3 was released, which is also based on Transformer architecture, but built using Decoders. It’s important to stay up-to-date with these new models/architectures, and continue learning along the way!

See you in the next blog! Happy Learning :)