Distil-RoBERTa for Hate Speech Classification and a Conceptual Review About Transformers

Published in

Analytics Vidhya

8 min readAug 5, 2021

Hello to the people that like to read nerdy tech stuff and those with ML projects due next week jaja. I realized from my past two posts that I enjoy writing about Artificial Intelligence, as it always means learning something new, or at the very least refresh some concepts while presenting information to the community. This time I wanted to write something about hate speech because of pride month, but I got caught up in my work and my graduation project, yet decided to do it almost two months late anyways.

This article is about the task of text classification. Specifically, we are using “🤗” Datasets library to get a twitter dataset for hate speech classification and Transformers library to fine tune a Distil-RoBERTa model for this use case, so let’s go!

If you want to just skip into the code, you can check out my colab notebook or scroll down until you see code.

RoBERTa is a Transformer model. Photo by Arseny Togulev on Unsplash

What is a Transformer?

No, I am not talking about autobots or decepticons. Transformers are state of the art Deep Learning architectures that revolutionized natural language processing. Their predecessors were based on complex recurrent or convolutional neural networks for text sequence modeling and transduction, making them difficult and slow to train. Instead, transformers only use attention mechanisms [1], which will be explained later in this article. Think of it as a black box for now that receives a key, a value and a query.

The architecture follows an encoder-decoder model. The encoder is a stack of N identical blocks that follows:

A Multi-Head attention layer
A pointwise fully connected layer
Residual connections between each of them.

On the other hand, the decoder is also a stack of N blocks with the following layers:

A masked Multi-Head Attention that receives the output embeddings shifted right with an extra special start token and another special ending token. We use these because we are trying to predict the next token given the previous ones. Moreover, the masking is performed in a way such that a token i on a certain sequence can only attend tokens on positions less than i for the prediction.
Another Multi-Head Attention layer that receives the output from the encoder for the values and keys and the queries are the output of the previous attention layer.
A good ol’ fully connected layer.
These layers also have residual connections in between.

After the decoder a dense layer with softmax activation is employed for token classification.

What is a Multi-Head Attention Layer?

The concept of attention is the key to extract context information from a text sequence. Now, think of a sentence that breaks into n word tokens.

Attention at the human level. Each word is a token.

What do we see from this figure? If we look at the “played” token, we can ask ourselves: where should attention go? well, we might say that football can get some attention, since that is what we were playing. Then friends would receive some attention, as these are the people I played with, and so on. Each arrow is an attention mechanism. We can see that attention is split across different entities. That is roughly the purpose of multi-head attention layers.

Now, these tokens are converted into numerical vectors of hidden dimension d called embeddings, resulting in a matrix of dimensions n x d that represents that text sequence.

Multi-Head attention architecture. Source [1]

That same matrix is going to be fed to three linear layers as seen in the picture. You can see that there is “V” for values, “K” for keys and “Q” for queries, but all these three are actually the same matrices at first. Moreover, the keys and queries are combined by matrix multiplication, normalized and fed to a softmax function, giving us “weights” that will be multiplied with the values. The point of all this is that words that would need attention will have more weight, hopefully adding context information to the original embeddings. And recall our previous figure with a sentence that had many arrows that represent attention mechanisms. Each of those would be an attention-head in our model and all the heads would work in parallel, effectively capturing different information.

BERT and RoBERTa transformers

The objective of transformers are to produce language models that can create numerical representations from text using the pre-training methodology that we saw earlier, where each token could only rely on context information from previous tokens, and then fine tuned for some downstream task like text classification, text summarization, etc.

BERT stands for Bidirectional Encoder Representations from Transformers [2]. The authors argued that previous transformers were restricting the power of pre-trained text representations by just using a left to right training.

With BERT however, they proposed a methodology called masked Language model (MLM). It simply masks 15% of the tokens at random. The objective was to allow bidirectional context information to be learned by the model and masked tokens would be used to predict the original one using cross-entropy loss.

In addition to MLM, the model is further trained for Next Sentence Prediction (NSP). The objective is to learn relationships between two sentences and to do that, they used pairs of sentences (like <question, answer>) to generate token sequences. The sequences involving two sentences had a special token to separate them. When choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A, labeled as IsNext; and 50% of the time it is a random sentence from the corpus, labeled as NotNext.

The architecture of BERT follows:

12 transformer blocks for base model, 24 for large model.
12 self-attention heads within each attention layer for base, 16 for large
Embeddings of hidden size of 768 for base, 1024 for large

The overall advantages of using this approaches is that BERT can generate more powerful representations than previous models and would need minimal architecture changes to fine tune for most of NLP tasks, as the authors said.

BERT can be easily fine tuned with minimal changes in architecture for most NLP tasks. Source [2]

Now, RoBERTa is actually based on BERT, with a few changes [3]:

Training the model longer, with bigger batches, over more data
Removing the next sentence prediction objective
Training on longer sequences
Dynamically changing the masking pattern applied to the training data

They found out that BERT was not having optimal training procedure, thus improving its performance by allowing this changes into the model pre-training. For comparison, you can see the results in the following table:

Table of comparisons. Last three columns correspond to three different datasets. Source [3]

Hate Speech Classification

Let’s start with the actual implementation. We are going to use “🤗” Datasets library. It holds many datasets for us to train and test our models. Check it out here if you want. For now we are going to use “tweets_hate_speech_detection” for our particular use case.

We can use the following snippet to list all the dataset names available:

Then, we load our desired dataset and split it in a 80–20 fashion for training and testing like so:

Now we want to know the distribution of labels, and plot it with a little help from matplotlib.

This is the barchart we get:

It can be seen clearly that the number of “normal” samples is far greater that the number of “hate speech” samples. This indicates that we might need to use precision and recall metrics, since accuracy by itself can be a little misleading when working with unbalanced data. More on that later!

Now let us instantiate the pretrained tokenizer for the RoBERTa model. This kind of tokenizers will transform our text sequences into numerical vectors and we are mapping that tokenizer over our train and test datasets.

And here it is our dear RoBERTa. We pass as an argument the number of classes for our training. It could be more than two for other use cases, of course. RoBERTa is going to put some context information to our sequence vectors that we produced earlier with our tokenizers and perform text classification in an end-to-end fashion.

Next, we need to actually train our model. “🤗” Transformers library provides a Trainer object that will facilitate model training with little coding on our side. But first, we need to define the hyperparameters to use. Note that we are not going to perform hyperparameter optimization here. I will leave that up to you guys. For now, let’s just use the ones below.

Moreover, we can actually pass to the trainer a function that will retrieve the performance metrics that we need. Why do we need precision and recall? Lets say that our model predicts all samples as normal speech at testing time. You would say “what a useless model”. But, in fact , that model would get 93% accuracy, which represents the proportion of samples that are labeled as “normal”.

Retrieved from https://en.wikipedia.org/wiki/Precision_and_recall. Yeah sue me!

From the picture above, we can see that precision tells us how many of our predictions are truly positive. Less false positives means higher precision. On the other hand, recall tells us from all the examples that belong to the positive class, how many we actually managed to predict as positive. Less false negatives, means higher recall.

In addition, we can also include the f1-score, which is nothing more than the harmonic mean of the former two metrics. Furthermore, the following code snippet shows the definition of our function that will compute them:

Now we pass the training args, the model, the metrics function and the datasets to the trainer and perform training and evaluation afterwards. Beware not to use other kinds of neural networks, since this object is only optimized for Transformers!

In the end, we got the following performance metrics:

{'epoch': 5.0,  
 'eval_accuracy': 0.9816958698372966,  
 'eval_f1': 0.8612099644128114,  
 'eval_loss': 0.12132309377193451,  
 'eval_precision': 0.9075,  
 'eval_recall': 0.8194130925507901,  
 'eval_runtime': 100.8981,  
 'eval_samples_per_second': 63.351,  
 'eval_steps_per_second': 3.964
}

Not too bad eh! But recall is not as high as we would want. Still, I think we can better fit our model by optimizing its hyperparameters, but that is a story for another day.

Epilogue

If you reached all the way down here, I appreciate you taking the time to read this article. I hope it can be useful to you! I do this to learn and to try to contribute to the community as a sign of gratitude. Also, lets end hate speech! we should be all equal in rights! and our differences is what makes this world beautiful.

See ya!

References

[1] A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Dec. 2017 [Online]. Available: http://arxiv.org/abs/1706.03762. [Accessed: 05-Aug-2021]

[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 [cs], May 2019 [Online]. Available: http://arxiv.org/abs/1810.04805. [Accessed: 05-Aug-2021]

[3] Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv:1907.11692 [cs], Jul. 2019 [Online]. Available: http://arxiv.org/abs/1907.11692. [Accessed: 05-Aug-2021]