Using a BERT Model for Sentiment Analysis:

Mirza Yusuf
6 min readApr 4, 2022

--

Exploring the basics of BERT on the Kaggle dataset:

Natural Language Processing with Disaster Tweets

Introduction

Text Classification in NLP(Natural Language Processing) is one of the most interesting as well as used domains today. Sentiment analysis a type of text classification is the process of extraction of qualities from text such as emotion/sentiment and is extremely useful in social media monitoring as it allows us to gain an overview of the wider public opinion behind certain topics.

The dataset that we are going to work in this article is called: Natural Language Processing with Disaster Tweets. This tutorial follows the building of a basic BERT model which is ideal for beginners. Using basic text tokenization and a pre-trained model this model achieved an accuracy of around 82%. The complete code to the following notebook can be found here in this colab notebook.

Note: This is an introduction to implementation and the article will not be explaining BERT as a model itself, if you are new and want to understand the theory behind the model this link will be good to get you started.

Dataset

The dataset used here is from a popular Kaggle challenge for beginners called: Natural Language Processing with Disaster Tweets which is basically a set of tweets related to disasters. There are around 7614 unique data points in the training dataset and 3264 in the test set. The training dataset is composed of 5 columns consisting of

  1. id: unique token to identify each data point.
  2. text: the text of the tweet.
  3. keyword: an important keyword from the tweet.
  4. location: the location the tweet was sent from.
  5. target: basically the sentiment of the tweet where ‘1’ indicates a real disaster and ‘0’ with not.

Let us take a peek at the data and mark out the important parts that we need for the model.

Looking at the dataset before building the model is very important, one can gain insightful knowledge on what parts of the data are useful and which are not. It is clear from the above example that the relevant columns of the data are text and target, the rest contribute little to none to the task at hand hence we can drop them.

Tokenization and Preprocessing

The first step is Tokenization, tokenization is essentially turning the sentences/words into tokens, tokens are used by the machine to understand the context and process the input better. BERT models generally have their own tokenizer which is based on an enormous library of words.

encode_plus returns a dictionary containing the encoded sequence or sequence pair and additional information: mostly masking information which we will use in our model, it returns the actual input ids, the attention masks, and the token type ids, and it returns all of these in a dictionary.

The aim of this notebook is to determine how well a pre-trained model can perform without any text preprocessing, so we can directly feed the model the data without any changes, but if we take a look at the class division below we can see that the data is a bit skew where the number of texts with label 0 being slightly more than the ones having their label as 1.

Division of Classes in the ‘target’ column

PyTorch offers a tool called the WeightedRandomSampler, this basically provides a weight for each class that places more emphasis on the minority classes such that the end result is a classifier that learns equally from all classes. You can read more about this here.

Model

The pre-trained model that we will be using is bert-base-uncased, Hugging face has made it really simple to use the transformer/BERT models. Let us start by defining the parameters the model will require beforehand.

train_maxlen essentially is the maximum sentence length we will be using for fine-tuning the model using our dataset. batch_size is the number of examples for training in one iteration. learning_rate helps to determine the step size at each iteration while moving toward a minimum of a loss function.

Now let us make the structure of the model that we will be using for training.

The model is pretty straightforward, a fine-tuned BERT model followed by a dropout layer and finally a linear layer with 2 outputs which is how we want our output to be.

The loss function that we will be using here is the BinaryCrossEntropy since it is a binary output.

Train Function

Now we move on to the train function, here we will be using model.train() to start the training, it is a PyTorch function that sets the layers in training mode. The inputs for the model are ids, mask, and token_type_ids essentially the output from our tokenization function.

zero_grad() restarts looping without losses from the last step when you use the gradient method for decreasing the error (or losses). We are calculating batch loss for every batch to see the trend and if early stopping is required.

The optimizer function that we are using is AdamW, which is a variant of the optimizer Adam that has an improved implementation of weight decay.

Running the Model

Finally, we are done with all the functions and now we can run our model.

lr_scheduler.StepLR decays the learning rate of each parameter group by gamma every step_size epochs.

Saving the model after 10 epochs, which we can use for prediction on the test set.

This is the output of the last epoch in the training, the running loss for each batch is mentioned.

Using accuracy as a metric could be used as well, for that we would have to divide the initial training dataset into train and dev sets and then use the dev set for evaluation purposes(is mentioned in the colab notebook).

Predicting

Now that we have trained and saved our model we can use it for prediction on our test set, know that we still have to do tokenization for the test set as well before inputting it into the model.

This gives an accuracy of around 82% on the test set.

Conclusion

BERT has been a major upgrade to the field of language representation models and has has given us state of the art results in all the major NLP-related problems. It has an ability to process larger amounts of text and language, provide an easy route to using pre-trained models (transfer learning), and capabilities to fine tune your data to the specific language context and problem you face. For certain situations, BERT can even be applied directly to the data with no further training (in other words, zero-shot training) and still deliver a high-performing model.

This article aims to help provide a very primary approach to fine tuning a BERT model by performing sentiment analysis. We input our data in a pretrained BERT model without any preprocessing and that alone gives us around 82% accuracy which performs way better than it’s former counterparts (LSTM’s or RNN’s).

We can still experiment more in this model by adding text preprocessing mainly — URL Links and Punctuation Removal, Spelling Correction, and Lemmatization and Stop Words Removal, or by using different BERT models from the hugging face website to see how accuracy varies.

--

--