Sentiment Analysis with Variable length sequences in Pytorch

6 min readApr 15, 2018

The Motivation

Recently I participated in a NLP competition on Kaggle where I finished in top 6%. As I am in the process of learning pytorch, I thought what better way to learn than to rewrite the code I have written in Keras for this competition.

This post focuses on how to implement sequence classification with variable lengths in pure pytorch.

Our final aim is to build a simple GRU model with concat pooling [5]. fig [6]. For this post I will use Twitter Sentiment Analysis [1] dataset as this is a much easier dataset compared to the competition. Download dataset from [2].

Full code of this post is available here.

UPDATE — START — July’19
Added pytorch 1.0 compatible notebook. It uses pytorch 1.1 and ignite training functions. Also better use of pytorch Dataset and Dataloader. Code is more compact and easy to understand.
Added a new repository that contains REST API build in Flask to deploy ML models in production.
UPDATE — END

Prerequisites

Basic knowledge of Pytorch
Understanding of GRU/LSTM [4]

Simple Data Analysis

Data overview

Let’s check the label distribution. Labels seems to be balanced.

Start Simple

Let’s first build the vocab of all unique words in lowercase, then convert all the tweets to indexes and calculate length of each tweet.

Check the lengths of the tweets after the tokenization.

Data after tokenization and replacing words with indexes.

Fig [3] Data view after tokenization and replacing words with indexes

Time for creating a simple pytorch dataset

Pytorch provides convenience classes for creating custom dataset and dataloader. You can read about it here and here.

ds[:4] -->(0             [14, 26, 132, 18, 10, 241549, 266, 6621]                        1                       [2, 272, 7, 90, 812, 1274, 16]                        2                           [247, 82, 217, 4573, 1012]                        3    [37, 241550, 4, 2, 73, 440, 6, 2, 73, 1454, 55...                        Name: sentimentidx, dtype: object,                         
0    0                        
1    0                        
2    1                        
3    0                        
Name: Sentiment, dtype: int64)

By using pytorch Dataloader we can use our custom dataset created above to load data in batched manner.

The Minor Details

If the data that comes out of the pytorch dataset is unpadded (if samples are of different lengths) then pytorch dataloader returns a python list instead of pytorch tensor with samples truncated to minimum length of the sample in the batch.

In the code below, the output of the first batch i.e. first three samples are truncated to 5 (shortest tweet length in the batch) and returned as python list.

Solution:

Pad the dataset and calculate the lengths of the tweets. In the below code you can see the output of padded dataset and dataloader. Now samples are of equal lengths and output of dataloader is LongTensor.

Note: I have taken the max length as 10 and padded the tweets that are shorter than 10 with zeros (to the right) and truncated otherwise. We will require lengths of the tweets for the next step.

To feed the variable length sequences to GRU we need to use pack_padded_sequence. To convert the output of GRU back to padded sequence, we use pad_packed_sequence. pack_padded_sequence require the batch to be sorted according to the lengths. Read about it here

Our simple GRU model

Fig [4] Simple GRU model by taking output of last hidden state

Also, pytorch RNNs take batch in the shape of sequence_length x batch_size x n_hidden, so we have to transpose the batch after sorting. Refer sort_batch function in the below code.
The output of GRU/LSTM contains padded hidden states of all timesteps. If we take gru_out[-1] (i.e. last timestep from GRU output) then our result will contain zeros in the samples that are shorter than max length sample Fig [5]. Good thing is that the last hidden state contains all the last non zero output from GRU.

Fig [5] GRU output and last hidden state

outp:
Variable containing:
-0.9536 -0.4867
-0.8185 -0.5818
-0.8788 -0.5366
[torch.FloatTensor of size 3x2]
torch.max(outp, dim=1):(Variable containing:
 -0.4867
 -0.5818
 -0.5366
 [torch.FloatTensor of size 3], Variable containing:
  1
  1
  1
 [torch.LongTensor of size 3])

Now we got the last output we can train our model. Refer training code on here.

Our final objective is to build GRU model with concat pooling [5]

Concat Pooling in simple terms means taking max and average of output of all timesteps and then concatenating them along with the last hidden state before passing it is output layer. Refer Concat Pooling section in this paper [5].

We can pass the output of GRU to Adaptive Max pooling and Adaptive Avg pooling functions of pytorch. But there is a problem with this method.
Since GRU output is padded to longest length sample (fig [5]), the average taken by F.adaptive_avg_pool1d() maybe lower than the actual because the zero padding will also be accounted.
Similarly the max pooling taken by F.adaptive_max_pool1d() maybe higher than the actual because if the hidden state contains negative value then with zero padding, zero will be taken as max instead of negative value (refer output below)
I have calculated the actual max and avg pooling below (refer below code line 25, 33). I got a slight increase in accuracy compared to pytorch adaptive max./avg pooling, but my approach is quite slow. I don’t know if there exists an elegant solution to this.

Adaptive avg pooling Variable containing:
-0.1657  0.3532 -0.2512 -0.0778 -0.5564
 0.2315  0.1153 -0.1446  0.3549 -0.0534
 0.1212  0.1945 -0.1012  0.2525 -0.1593
[torch.FloatTensor of size 3x5]

By hand Adaptive avg pooling Variable containing:
-0.1657  0.3532 -0.2512 -0.0778 -0.5564
 0.2646  0.1318 -0.1653  0.4056 -0.0610
 0.1939  0.3112 -0.1619  0.4040 -0.2549
[torch.FloatTensor of size 3x5]

Adaptive max pooling Variable containing:
 0.4062  0.5081  0.1206  0.2272 -0.0184
 0.5597  0.4545  0.2079  0.5897  0.4978
 0.3965  0.5072  0.1130  0.5585  0.0000
[torch.FloatTensor of size 3x5]

By hand Adaptive max pooling Variable containing:
 0.4062  0.5081  0.1206  0.2272 -0.0184
 0.5597  0.4545  0.2079  0.5897  0.4978
 0.3965  0.5072  0.1130  0.5585 -0.0092
[torch.FloatTensor of size 3x5]

You can compare the output of F.adaptive_max_pool1d() and max pooling by hand (custom code) in the above output. The max pooling by hand contains negative value as the max value while it is zero with F.adaptive_max_pool1d().

Ways to improve accuracy

The accuracy I got on training set with above setup is ~81%. There are a lot of ways to improve it.

Include validation set for better accuracy measure
Better preprocessing
More hidden units
Pretrained word/char embeddings e.g. Glove, Fasttext, Word2vec
Regularization e.g. Dropout
Better model e.g. Bidirectional GRU, GRU with attention

In the next post I will cover Pytorch Text (torchtext) and how it can solve some of the problems we faced with much less code. Also, I will include the above mentioned tips to improve accuracy.

This is my first blog post. Please correct me if you find any mistake or if I have left out any reference.

References

[1] http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
[2] http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
[3] https://stackoverflow.com/questions/46387661/how-to-correctly-implement-a-batch-input-lstm-network-in-pytorch
[4] https://colah.github.io/posts/2015-08-Understanding-LSTMs/
[5] https://arxiv.org/abs/1801.06146