Intro to Pytorch with NLP

6 min readApr 16, 2019

Overview

When it comes to options for deep learning within the Python ecosystem, there are TONS of choices. Keras is a great choice for starting out and for quickly developing and iterating on models, pure Tensorflow is amazingly fast, and with the recent advent of Tensorflow 2.0, will only become more awesome. However, over the past few years, there has been a huge surge in popularity for Pytorch. In this first post, I’d like to introduce some of the main Pytorch concepts, and apply them to a common task in natural language processing: Named Entity Recognition (NER).

In case you’re unfamiliar with the task, NER is a supervised learning task which attempts to classify each word in a sentence as belonging to some set of categories. Common categories that many NER systems are interested in identifying are things like “organizations”, “locations”, “names”, “dates”, etc.

So the plan for us is to use this toy task as a means for learning about Pytorch. Let’s do it!

Loading the data

All right, so now that we have the model architecture set up, we need some data! There are actually some pretty neat ways of iterating over custom datasets in Pytorch, but for the time being we’ll keep things simple. The dataset we’ll be using can be found here.

Because the CSV is formatted in a very particular way, we’ll need to do some preprocessing of it to get it into a form that’s easier to handle. The following snippet will read the CSV, and create a list of tuples. Each tuple will contain the tokenized sentence and the list of NER tags corresponding to that sentence.

At the end of this snippet, I’ve also included a couple lines which grab all of the unique NER tags that appear. The purpose of this is to be able to assign each tag to some categorical class later on. Now let’s create our model!

Creating the model class

One of the notable differences between Pytorch and other deep learning libraries is that to create a neural network, we need to define a class for our model.

The constructor of our model’s class will first make a call to the constructor of the parent class, torch.nn.Module . After this, we are free to define whatever other instance variables we’d like. This typically takes the form of defining the dimensions of various input/output tensors, and placeholders for the various layers, embeddings, and transformations we might be interested in using.

In addition to the constructor, we need to define a method called forward. As its name suggests, forward will contain whatever calculations are needed to complete the forward-pass across our model. In addition to these two requirements, we're free to include additional helper methods as needed!

Now let’s build a simple recurrent neural network for NER. The first step will be to learn the word embeddings for this task. We’ll specify the size of our embeddings with embedding_dim , and let vocab_size denote the total number of words in our vocabulary.

We’ll then pass this sequence of embeddings to our LSTM cell. Here, hidden_dim specifies the size of the hidden state vector. The last step is to pass the final LSTM output to a fully-connected layer to generate the scores for each tag. One other point that you might notice in the following code is the .view() method. You can think of .view() as essentially Pytorch’s version of a reshape in NumPy.

Instantiating the model

In the last section we’ve created the model’s class, but we haven’t actually created an instance of the model yet. So let’s do that now! We’ll also make a quick helper function.

We can’t just toss the raw sequences tokens or tags into the model, we need to first convert them to indices which correspond to either particular words in our vocabulary, or to particular tags. To do this, we’ll use the helper function prepare_sequence , along with word_to_ix and tag_to_ix to preprocess each sentence. Notice that the output type of prepare_sequence is a torch tensor object.

For demo purposes, I’ll only train things on the first 100 samples, and use 64 as the size of my word embeddings and hidden state vectors.

With those bits out of the way, let’s create an instance of our model and define a loss function. We’ll use negative log-likelihood as our loss function, and standard stochastic gradient descent as our optimizer.

The last line in the following snippet is one of the features I think is really nice about Pytorch. This line will detect whether we have a GPU available and allow us to easily transfer our model and data over to the GPU for training. One of the pain points I’ve experienced with Tensorflow and Keras is making sure that my models are actually using GPU resources, and I’ve found that Pytorch makes this very simple. In fact, for reasons like this, it is actually more appropriate to think of Pytorch as sort of a GPU enabled version of NumPy rather than a pure deep learning library.

Training

All right, so we have our data, we have our model, now let’s train it! Training in Pytorch is another area where it differs pretty significantly from Keras or other machine learning libraries that you may have seen in the past. Rather than the usual model.fit(...) and model.predict(...)that you may be used to, we need to explicitly iterate over the number of epochs, and over the training data.

In addition to iterating over epochs/samples, there are a couple other “gotchas” to watch out for. Models in Pytorch have two “modes”. Training mode allows us to tell Pytorch that we are (surprise surprise…) training the model. This may seem strange at first, but if we’re using things like drop-out, our model may handle the behave slightly differently depending on whether we are training or evaluating new input.

To actually calculate how much we’d like to update our model’s weights, we need information about the gradient. Pytorch automatically computes the gradient of whatever loss criterion we define, but the important thing to note is that the gradient accumulates across training samples. This means that if we just keep passing more and more training samples through our network, the gradient information stored by the network will continually get larger and larger! Combating this is easy — we just need to zero out the model’s gradient information between batches.

Now we’re ready to actually throw things through the model. We’ll do any preprocessing (if necessary) and then simply run it through our model to compute the forward-pass. In the same vein, we can then compute the loss by throwing our predicted output and training labels into whatever loss function we’ve defined.

After computing the loss, we can then do the actual backward pass, and finally update our model’s parameters. If this all seems rather annoying, that’s because it is! In a future post we’ll see some 3rd party libraries that can help alleviate some of this pain, and allow for easier ways to do model checkpointing, learning rate scheduling, and integrating with tensorboard.

Evaluation

All right, now that our model’s trained let’s feed it some new data to see how it performs. Before evaluating a new sample, we need to remember to put our model in evaluation mode! The variable N will allow us to pick some sample from out training data, and we can then quickly inspect preds and correct as a qualitative test of how good our model is doing.

This is a pretty hand-wavy way of checking our model which is actually okay in this example. Keep in mind that the goal of this post isn’t about assessing model performance, but rather how to build a simple model with Pytorch.

Original       |Correct        |Predicted
       
However        |O              |O               
,              |O              |O               
a              |O              |O               
video          |O              |O               
released       |O              |O               
Monday         |B-tim          |B-tim           
,              |O              |O               
by             |O              |O               
Iran           |B-geo          |B-gpe           
shows          |O              |O               
the            |O              |O               
sailors        |O              |O               
and            |O              |O               
marines        |O              |O               
relaxing       |O              |O               
and            |O              |O               
socializing    |O              |O               
during         |O              |O               
their          |O              |O               
captivity      |O              |O               
.              |O              |O

Conclusion

Congrats! If you’ve made it this far you now know the basics of how to make a simple model with Pytorch. We did it for the task of NER here, but this frame work could be extended to tons of other tasks. Do you have any Pytorch tips or tricks you’d like to share? Are there any n00b mistakes in this post you’d like to correct? Let me know!

📝 Read this story later in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.