Sentiment Analysis — TorchText

Apr 28, 2018 · 6 min read
Image for post
Image for post

This post is the second part of the series. In the first part I built sentiment analysis model in pure pytorch. In this post I do the same task but in torchtext and demonstrate where torchtext shines and also it reduces a lot of code.

Sentiment analysis is a classification task where each sample is assigned a positive or negative label. You can follow along with the code here.

Download dataset from [2]

Typical components of classification task in NLP

  1. Generating vocabulary of unique tokens and converting words to indices
  2. Loading pretrained vectors e.g. Glove, Word2vec, Fasttext
  3. Padding text with zeros in case of variable lengths
  4. Dataloading and batching
  5. Model creation and training

Why use torchtext

I have split the data in train and validation set and saved as csv.

Note: Make sure to remove all the ‘\n’ character before saving the csv as torchtext have trouble handling ‘\n’ character.

1. Define how to process data

Image for post
Image for post

The first step is to declare what attributes (columns) in the dataframe we want to use and how to process them. The dataframe consists of 4 columns (‘ItemID’, ‘Sentiment’, ‘SentimentSource’, ‘SentimentText’) and we want to use only ‘Sentiment’ and ‘SentimentText’.

The label column (‘Sentiment’) is binary and already in numerical form, so there is no need to process it. The tweet column (‘SentimentText’) needs processing and tokenization, so that it can be converted into indices.

Let’s break down the above code. In torchtext, a column can be called as field. The field object takes arguments on how to process (tokenize etc.) the text. This field object will later be attached to dataset.

Line 10 defines the blueprint of how a column or field will be handled when we pass the actual data (tweets) in the future.

For text columns or fields, below parameters are used.'sequential=True'
It tell torchtext that the data is in form of sequence and not discrete
This attribute takes a function that will tokenize a given text. In this case the function will tokenize a single tweet. You can also pass 'spacy' string in this attribute if spacy is installed.
Apart from tokenized text we will also need the lengths of the tweets for RNN
Since this is used to process the text data, we need to create the vocabulary of unique words. This attribute tells torchtext to create the vocabulary

Line 15 defines the blueprint of how a column or field will be handled when we pass the actual data (labels) in the future.

For label columns or fields, below parameters are used.'sequential=False'
Now we are defining the blueprint of label columns. Labels are not sequential data, they are discrete. So this attribute is false

Since it is a binary classification problem and labels are already numericalized, we will set this to false
We don't need padding and out of vocabulary tokens for labels.

In line 20 we define how each column will be processed. The columns with None value will be ignored and will not be loaded.

2. Create torchtext dataset

So far we have defined the blueprints for processing. Now we will actually load the data for processing.

Line 27 contains TabularDataset.splits() method is used when we want to process multiple files (train, validation, test) in one go that uses same processing.

Path were the csv or tsv files are stores
format of the files that will be loaded and processed
Name of train file. The final path will become ./data/traindf.csv
Name of validation file. The final path will become ./data/valdf.csv
Tell torchtext how the coming data will be processed
skip the first line in the csv, if it contains header

TabularDataset.splits() will return train dataset and validation dataset.

Let’s look at what this TabularDataset object contains.

TabularDataset is a list which contains Example object. Example object wraps all the columns (text and labels) in single object. These columns can be accessed by column names as written in the above code.

3. Load pretrained word vectors and building vocabulary

Or if you have already downloaded pretrained vectors then you can specify the path and torchtext will read and load it. I have used downloaded vectors in the code below.

In line 4 and 6 above, torchtext build the vocabulary based on the text that is provided in column “SentimentText”. Vocabulary is built on the text in train dataset and validation dataset and define max number of unique words as 100000. Words which are not in vocabulary will be assigned <unk> token. Pass the pretrained vectors during vocabulary building.

Now when you execute line 4, torchtext creates a dictionary of all unique words and arrange them in decreasing order of their frequency and adds <unk> and <pad> token at the beginning of this dictionary. Next torchtext assign unique integer to each word and keep this mapping in txt_field.vocab.stoi (string to index) and reverse mapping in txt_field.vocab.itos (index to string).

4. Loading the data in batches

Note: BucketIterator returns a Batch object instead of text index and labels. Also Batch object is not iterable like pytorch Dataloader. A single Batch object contains the data of one batch .The text and labels can be accessed via column names.

This is one of the small hiccups in torchtext. But this can be easily overcome in two ways. Either write some extra code in the training loop for getting the data out of the Batch object or write a iterable wrapper around Batch Object that returns the desired data. I will take the second approach as this is much cleaner.

With the code above we can directly use it in the training loop just like pytorch Dataloader.

5. Finally Model and training

Image for post
Image for post

Other classes in torchtext

This wraps up the short discussion on torchtext for sentiment analysis task. It was an overview of torchtext. In the next post I will discuss about implementing Attention for sentence classification.




Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store