Let’s Get Sentimental (with Pytorch)

Adam Wearne
May 20 · 8 min read

Overview

Hello and welcome back to the nail-biting continuation of this series on Pytorch and NLP. In the last post we saw the basics of how to build a Pytorch model and how to train it. In that post, we identified a few pain points regarding model construction and training. In this post, we’ll look at a few 3rd party libraries that we can use alongside Pytorch to make our lives a little easier when it comes to training, model check-pointing, and evaluation.

In particular, the libraries we’ll be using today include tensorboardxandtorchtext , along with a little help from spacy . The task we’ll be solving today is a classic one in NLP — Sentiment analysis. Let’s get started!


Preprocessing the Data

The dataset we’ll be looking at here is the famous (infamous?) IMDB movie review dataset. As usual, the first step is to curate the data in such a way that makes it easy for us to iterate over and feed into our model. Previously, the way we handled this was by creating a dictionary that mapped unique words in our vocabulary to integers, and then did our computations on that sequence of integer IDs. Wouldn’t it be nice if there was another library that could do this for us? Ask and ye shall receive.

TorchText is a package that contains tons of preprocessing utilities and datasets that are really common to NLP tasks. One of the most central concepts in how TorchText handles data is the Field class. Field allows us to define how we want our input/output data to be handled. It allows us to specify the type of data (raw text, lists of characters, labels, etc.), how we’d like to tokenize it, and other common preprocessing tasks. We’ll start with some of the standard imports.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchtext import data
from torchtext import datasets
from tensorboardX import SummaryWriterimport random

The torch imports should look familiar to you, and we’ll touch on the Tensorboard stuff later. For now let’s focus our efforts on working with TorchText. The first step is to define the Field (s) that we would like to use. We’ll need one Field for the actual text strings themselves (along with some additional key word arguments), as well as a Field for the sentiment labels.

SEED = 407TEXT = data.Field(tokenize = 'spacy', lower=True, include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)

We’ll first set a random seed to ensure reproducibility of the results. TEXT will be the field corresponding to the actual string data. By default, TorchText will tokenize inputs based on whitespace, but here we’ll use the spaCy to tokenize things in a more sophisticated manner. As a preprocessing step we’ll convert everything to lowercase (this is not the default behavior), and include the actual lengths of the tokenized strings. This will come in handy later when organizing our training data into mini-batches.

The LABEL field is a special type of Field which is designed to hold labels! Whoa, what a shocker. The only thing we need to be careful of here is that (as of the time of writing) by default the labels get cast to the type torch.LongTensor which doesn’t match the data type expected for Pytorch’s loss functions. Easy fix is to just specify the dtype .

A nice feature of TorchText is that it includes the IMDB dataset, so we don’t need to concern ourselves with downloading and setting up the appropriate file paths. Moreover, it provides some really easy ways to split our data up into training, validation, and test sets.

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

Okay, so we have our Field objects instantiated, we have the data loaded, we now need to build our vocabulary. In the last post, this was done by creating a dictionary and mapping each word to some integer ID. TorchText can not only automate this process for us, but also provides some additional goodies.

MAX_VOCAB_SIZE = 10000TEXT.build_vocab(train_data, 
max_size=MAX_VOCAB_SIZE,
vectors='glove.6B.300d',
unk_init=torch.Tensor.normal_)
LABEL.build_vocab(train_data)

To build the vocabulary in an unbiased way, we need to do so on the training data only. Here, we’ll take the top 10,000 most frequently occurring words as our total vocabulary. Beyond that, we can leverage pre-trained word embeddings rather than training them from scratch. Here, we’ll use a 300-dimensional GloVe embedding. By default, any words that appear in our vocabulary, but do not appear in the chosen embedding get initialized to a vector of all zeroes. This could slow down the speed at which we converge to a decent solution. To avoid this, we’ll use the unit_init argument to set any such words to a vector with its entries chosen from a normal distribution.

The last thing we’ll need to do before setting up our model is to create a set of iterators. We’ll define a size for our mini-batches, and as before, we’ll define a device variable. TorchText’s BucketIterator will attempt to batch together samples with similar lengths to minimize the amount of padding needed.

BATCH_SIZE = 64device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
(train_data, valid_data, test_data),
batch_size = BATCH_SIZE,
sort_within_batch = True,
device = device)

Building the Model

For the actual model itself, we’ll be using a bidirectional LSTM. There’s nothing too terribly out of the ordinary happening here as this is a pretty standard architecture for sentiment analysis tasks.

class AdamNetV2(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, n_layers,
is_bidirectional=True, dropout=0.0, output_dim=1, padding_idx=None):
super().__init__()

self.embedding = nn.Embedding(vocab_size, embedding_dim,
padding_idx=padding_idx)

self.lstm = nn.LSTM(embedding_dim, hidden_dim,
num_layers=n_layers, bidirectional=is_bidirectional,
dropout=dropout)

self.fc = nn.Linear((is_bidirectional+1)*hidden_dim, output_dim)

self.is_bidirectional = is_bidirectional

def forward(self, input_sequence, sequence_length):

embeddings = self.embedding(input_sequence)

packed_embeddings = nn.utils.rnn.pack_padded_sequence(embeddings,
sequence_length)

packed_output, (hidden_state, cell_state) = self.lstm(packed_embeddings)

if self.is_bidirectional:
output = torch.cat((hidden_state[-2,:,:], hidden_state[-1,:,:]), dim = 1)
else:
output = hidden_state[-1,:,:]
scores = self.fc(output)

return scores

So this model first uses the embedding layer as a look-up table to get the correct sequence of word vectors from our series of integer IDs. We then pad the each sequence in the batch so that they’re all the same length, and pass this batch of padded sequences through the recurrent portion of our model. Here, we’re using a bidirectional LSTM with three layers. After each sequence is processed, we’ll take the LSTM hidden states (corresponding to the forward and reversed sequence) and concatenate them. This concatenated vector is then passed to a final fully-connected hidden layer to make the sentiment prediction.

vocab_size = len(TEXT.vocab)
embedding_dim = 300 # This needs to match the size of the pre-trained embeddings!
hidden_dim = 256
num_layers = 3
dropout = 0.5
pad_idx = TEXT.vocab.stoi[TEXT.pad_token]
model = AdamNetV2(vocab_size=vocab_size, embedding_dim=embedding_dim,hidden_dim=hidden_dim,
n_layers=num_layers, dropout=dropout,
padding_idx=pad_idx)

With the model defined, let’s then initialize the embedding layer of our network with the GloVe vectors mentioned earlier. To do this, we just need to copy the vectors over into the weights of our model’s embedding layer. Recall that we had also previously initialized some words that did not appear in GloVe to random vectors. This is good for most words, but in particular, we’d like our model to ignore padding and unknown tokens as much as possible. To do this, we’ll just set those vectors to zeros explicitly.

Finally, we’ll define our loss function (binary cross-entropy), our optimizer, and move everything over to GPU.

# Initialize word embeddings
glove_vectors = TEXT.vocab.vectors
model.embedding.weight.data.copy_(glove_vectors)
# Zero out <unk> and <pad> tokens
unk_idx = TEXT.vocab.stoi[TEXT.unk_token]
model.embedding.weight.data[unk_idx] = torch.zeros(embedding_dim)
model.embedding.weight.data[pad_idx] = torch.zeros(embedding_dim)
# Define our loss function, optimizer, and move things to GPU
criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)
optimizer = optim.Adam(model.parameters())

For training, let’s define a few helper functions to process each batch. We’ll have one for calculating the accuracy across a batch, one for actually training and updating the model, and a final helper function to evaluation on the validation set. These functions just act as wrappers for updating and evaluating the model as we discussed in the previous post.

def accuracy(scores, y):        scores = torch.round(torch.sigmoid(scores))
correct = (scores == y)
acc = correct.sum() / len(correct)
return acc
def train(model, iterator, optimizer, criterion):

epoch_loss = 0
epoch_acc = 0

model.train()

for batch in iterator:

optimizer.zero_grad()

text, text_lengths = batch.text

predictions = model(text, text_lengths).squeeze(1)

loss = criterion(predictions, batch.label)

acc = accuracy(predictions, batch.label)

loss.backward()

optimizer.step()

epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)def evaluate(model, iterator, criterion):

epoch_loss = 0
epoch_acc = 0

model.eval()

with torch.no_grad():

for batch in iterator:
text, text_lengths = batch.text

predictions = model(text, text_lengths).squeeze(1)

loss = criterion(predictions, batch.label)

acc = binary_accuracy(predictions, batch.label)
epoch_loss += loss.item()
epoch_acc += acc.item()


return epoch_loss / len(iterator), epoch_acc / len(iterator)

Training the model

Now we can actually begin training the model. But let’s add something extra to make this time a little more interesting. We can use TensorBoardX to log information about key metrics throughout the training process, and create visualizations of our word embeddings. Tensorboard visualizations are more tightly integrated with Tensorflow/Keras, but TensorboardX allows us to do similar things with Pytorch.

We first need to create a SummaryWriter object, and then simply tell it to add whatever whatever metrics we’d like to record! Here, we’ll save the loss and accuracy per epoch, and then at the very end, we’ll save our word embeddings so that we can visualize them.

summary_writer = SummaryWriter(log_dir=f"tf_log/")num_epochs = 10best_valid_loss = 1000000for epoch in range(num_epochs):    
train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

# Log the training results
summary_writer.add_scalar("training/accuracy", train_acc, epoch)
summary_writer.add_scalar("training/loss", train_loss, epoch)

# Log the validation results
summary_writer.add_scalar("validation/accuracy", valid_acc, epoch)
summary_writer.add_scalar("validation/loss", valid_loss, epoch)


# After completing all epochs, visualize our word vectors
vecs = model.embedding.weight.data
labels = [l.encode('utf8') for l in TEXT.vocab.itos]
summary_writer.add_embedding(vecs,
metadata=labels)
summary_writer.close()# Print test performance
test_loss, test_accuracy = evaluate(model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f}\nTest Acc: {test_acc*100:.2f}%')

I ended up getting a test accuracy of around 88%. Not bad! Now let’s take a look at our results with Tensorboard. To start it up, run the following command within a terminal tensorboard --logdir=<WHERE EVER YOUR LOG FILES ARE> . Doing so should start up the Tensorboard server.

Looks like we started to overfit at the fourth epoch

To look at the embeddings, check out the Projector tab! Here we can see a nice interactive visualization of our final word embeddings via PCA or t-SNE.

TorchText and TensorboardX are both really helpful libraries for building more robust models with Pytorch. We’ve only scratched the surface here, so keep an eye out for the next post where we’ll build a chat bot!

Thanks to Beatriz Miranda

Adam Wearne

Written by

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade