NLP: Zero To Hero [Part 3: Transformer-Based Models & Conclusion]

6 min readMar 23, 2023

Link to Part 1 of this article:
NLP: Zero To Hero [Part 1: Introduction, BOW, TF-IDF & Word2Vec]
Link to Part 2 of this article:
NLP: Zero To Hero [Part 2: Vanilla RNN, LSTM, GRU & Bi-Directional LSTM]
Link to the Colab File:
https://github.com/PrateekCoder/NLP_Zero_To_Hero

This article is the continuation of NLP: Zero To Hero Part 1 and Part 2. In the previous articles, we covered text pre-processing, feature extraction methods and built sentiment analysis models using SVM with different vectorizers like BOW, TF-IDF, and Word2Vec, in the second part of the article we built sentiment analysis models using Vanilla RNN, LSTM, GRU, and Bi-Directional LSTM. In this article, we will use pre-trained transformer-based models for our sentiment analysis task and compare their performance with all the previous models. They say Transformer is the best we have got till now, let's put it to test.

Transformers

Transformers are a type of neural network architecture that has revolutionized natural language processing (NLP) tasks such as language modeling, machine translation, and sentiment analysis. The transformer architecture was introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017.

The key innovation of the transformer architecture is the self-attention mechanism, which allows the model to weigh the importance of different words in the input sequence when making predictions. Self-attention allows the model to capture long-range dependencies in the input sequence without being limited by the length of a fixed-length window, as in traditional RNN models.

The transformer architecture consists of an encoder and a decoder, each made up of multiple layers of self-attention and feedforward neural networks. The encoder takes in the input sequence and outputs a sequence of hidden states, which are then fed into the decoder along with the target sequence. The decoder uses the self-attention mechanism to attend to the encoder’s hidden states and generate the output sequence.

The transformer architecture has been implemented in several pre-trained language models such as BERT, GPT-2, and T5, which have achieved state-of-the-art performance on various NLP tasks. These pre-trained models can be fine-tuned on a specific downstream task with relatively little labeled data, making them highly versatile and widely used in industry and academia.

If you want to learn about Transformers, read this research paper:
https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
But if you cannot read this paper like me and want a simpler explanation, go through this article by Jay Alammar:
https://jalammar.github.io/illustrated-transformer/

How do Transformer Based Models Work?

Transformer-based models like DistilBERT or RoBERTa are designed to process sequential data, such as natural language text, and extract meaningful features that can be used for various NLP tasks such as sentiment analysis, text classification, and question answering. They work by using a large number of pre-trained parameters that are trained on massive amounts of text data in an unsupervised manner, which allows them to learn general patterns and semantic relationships between words and phrases.

During the pre-training phase, the transformer-based model learns to predict the next word in a sentence or to classify whether two sentences are related or not, based on the context of the surrounding words. This is done using a self-attention mechanism, which allows the model to focus on different parts of the input text and capture the context and relationships between words.

After pre-training, the model can be fine-tuned on a specific task with a smaller labeled dataset. During fine-tuning, the model’s parameters are adjusted to improve its performance on the specific task. For example, in sentiment analysis, the model learns to map the input text to a positive or negative sentiment label.

Overall, transformer-based models like DistilBERT or RoBERTa have demonstrated state-of-the-art performance on a variety of NLP tasks, making them a popular choice for many NLP applications.

Sentiment Analysis model using Pre-Trained DistilBERT Model

I will be using a pre-trained Transformer based model called DistilBERT from the transformers library.

Here are the steps we will follow to build our model:

Import necessary packages and libraries, including TensorFlow, transformers, and NumPy.
Connect to TPU and create a distribution strategy using tf.distribute.TPUStrategy().
Set the name of the pre-trained model to use (distilbert-base-uncased) and create an AutoTokenizer object from the transformers library.
Set the maximum length of the input sequences to 512.
Split the data into train, test, and validation sets using train_test_split() from scikit-learn.
Tokenize the text data using the tokenizer object and convert the output into a TensorFlow tensor using tf.convert_to_tensor(). Convert the sentiment labels to a NumPy array.
Further split the training data into training and validation sets using train_test_split() again.
Create TensorFlow datasets for training and validation using tf.data.Dataset.from_tensor_slices().
Define the model architecture using TFAutoModelForSequenceClassification() and compile the model using Adam optimizer, SparseCategoricalCrossentropy loss function, and accuracy as the evaluation metric.
Set up early stopping to prevent overfitting during model training.
Train the model using fit() method and the train and validation datasets, with the defined batch size, number of epochs, and early stopping callback.
Tokenize the test data using the tokenizer object and convert the output into a TensorFlow tensor using tf.convert_to_tensor(). Convert the sentiment labels to a NumPy array.
Create a TensorFlow dataset for the test data.
Predict the labels for the test dataset using the trained model and calculate the accuracy.

Sentiment Analysis Model using DistilBERT Model

Accuracy of DistilBERT Model on Raw Text Data

Accuracy of pre-trained DistilBERT Model on raw text data: 92.00%
Accuracy of pre-trained DistilBERT Model on pre-processed data: 85.00%

When I trained the above model on pre-processed data, I was getting an accuracy of 85%, but on the raw text data and with early stopping, I got an accuracy of 92%.

Sentiment Analysis model using Pre-Trained Roberta Model

Following the same steps as we followed for DistilBERT model, just using roberta-base model instead of distilbert.

Sentiment Analysis Model using Pre-Trained Roberta Model

Accuracy of Roberta Model on Raw Text Data

Accuracy of pre-trained Roberta Model on raw text data: 95.00%

** When I just ran the model on raw text without early stopping, during the training process, after 6 epochs my accuracy started dropping significantly from 99% to almost 50%. I reduced the learning rate and applied early stopping and was able to achieve 95% accuracy in 5 epochs.

Comparing the Accuracy of all the models for our Sentiment Analysis:

Accuracy of SVM with BOW: 88.04%
Accuracy of SVM with Tf-IDF: 90.04%
Accuracy of SVM with Custom Word2Vec: 49.92%
Accuracy of SVM with Google Word2Vec: 85.74%
Accuracy of Vanilla RNN Model: 85.64%
Accuracy of LSTM Model: 86.40%
Accuracy of GRU Model: 85.60%
Accuracy of Bi-Directional LSTM Model: 86.74%
Accuracy of DistilBERT Model: 92%
Accuracy of Roberta Model: 95%

Based on the results of the sentiment analysis models, it can be concluded that traditional machine learning models like SVM with BOW and SVM with Tf-IDF still perform very well for text classification tasks like sentiment analysis. However, it is also evident that deep learning models like LSTM, GRU, Vanilla RNN, and Bi-Directional LSTM are also capable of achieving high accuracy. Additionally, Transformer-based models like DistilBERT and RoBERTa have shown remarkable performance in sentiment analysis.

It is important to note that the choice of model for a particular task depends on several factors, including the size of the dataset, the complexity of the problem, the amount of labeled data available, and the computational resources available. Thus, the performance of a model should be evaluated in the context of these factors.

Overall, the study highlights the importance of evaluating multiple models and selecting the best-performing model for a particular task.