BERT: Using Hugging Face for Sentiment Extraction with PyTorch

Published in

Analytics Vidhya

5 min readMar 28, 2020

In this post, I will walk you through “Sentiment Extraction” and what it takes to achieve excellent results on this task.

I will use the “ Tweet Sentiment Extraction” dataset from kaggle in this walk-through. For more details on the competition and the dataset, you can visit https://www.kaggle.com/c/tweet-sentiment-extraction/ .

All the code I will use in this post is available at https://github.com/aksub99/bert-sentiment. Please star the repository if you find it useful so that it can reach a wider audience.

First of all, what exactly is the task?

We are given a tweet and it’s sentiment (positive, negative or neutral) and we are asked to predict the portion of the tweet that supports/signifies this sentiment.

We can see that our dataset contains the tweet in the “text” column and it’s sentiment in the “sentiment” column. The portion of the tweet that represents the sentiment is given in the “selected_text” column.

Our job is to create a model to predict “selected_text”, given “text” and “sentiment”.

As you might have observed, this task is extremely similar to Question Answering. The only difference is that the question has been replaced by the sentiment, the context/passage by the tweet and the answer by the portion of the tweet signifying the sentiment.

Now that we have understood the task and the dataset, let us get into the implementation details.

To start off, let’s check the dataset for nan values and delete the corresponding rows.

We will use huggingface transformers. So, let’s go ahead and install it.

Different NLP algorithms require different types of tokenizations of the input word sequence. The BERT architecture requires the input words to be Wordpiece tokenized. ‘BertTokenizer’ does this for us.

These lines fetch for us the tokenizer required for our BERT model. This can be utilized later to convert our input sequence into the form required by BERT.

We now need to pad the input sentences such that they are all of the same length. To find out the amount of padding required for different sentences, we will first need to identify the length of the longest input sentence.

We get a max length of 110 in the tweets. So, we will have to pad all tweets to achieve a common length of 110.

But wait……since our BERT’s input will be a concatenation of “sentiment” and the “tweet”, our combined sequence will have to be padded to achieve the maximum length of this combined sequence, not just that of the tweet.

We can safely assume that this combined sequence will have a maximum length of 150. So, we will pad our combined input sequences to achieve a common length of 150 for all sequences.

Additionally, our model needs some way of differentiating between the sentiment and the tweet since we are feeding a concatenation of these as input. Similarly, our model also needs to be able to differentiate between the tokens corresponding to the words and those corresponding to pad tokens.

To recap, we need to:

Concatenate the sentiment and the tweet.
Tokenize this combined sequence according to BERT’s requirements.
Pad this combined sequence to a length of 150.
Have some way of differentiating between word tokens and pad tokens and between sentiment and tweet.

All these steps can be performed with the help of a magical function known as encode_plus, which is a method available under the Tokenizer class.

We get 3 tensors above — “input_ids”, “attention_masks” and “token_type_ids”.

1) “input_ids” contains the sequence of ids of the tokenized form of the input sequence.

2) “attention_masks” contains a sequence of 0s and 1s. It helps us differentiate between the input sentences and the pad tokens.

3) “token_type_ids” also contains a sequence of 0s and 1s. It helps us differentiate between the first sequence (sentiment) and the second(tweet). This will be utilized by the model to add [SEP] tokens between the two sequences.

Let us print these tensors to get a better idea of what they contain.

Lastly, since “selected_text” is just a portion of the tweet, it can easily be represented as a start and end index. We need to feed these to our model as labels while training.

But, please note that this start and end index must be with reference to the combined sequence (sentiment + tweet) and not just the tweet. So, we now convert our “selected_text” column into start and end positions.

The code for extracting the start and end indices is given in https://github.com/aksub99/bert-sentiment . I will not display it here for the sake of brevity.

Let’s now split the dataset into training and validation sets and create PyTorch DataLoaders for these.

We’ll now create the model and load pretrained weights. We will use “BertForQuestionAnswering” because the task is very similar to answering questions and therefore requires the same architecture.