Understanding emotions — from Keras to pyTorch

Introducing torchMoji, a PyTorch implementation of DeepMoji

Detecting emotions, sentiments & sarcasm is a critical element of our natural language understanding pipeline at HuggingFace 🤗. Recently, we have switched to an integrated system based on a NLP model from the MIT Media Lab.

Update: We’ve open sourced it! Repo on GitHub

The model was initially designed in TensorFlow/Theano/Keras, and we ported it to pyTorch. Compared to Keras, pyTorch gives us more freedom to develop and test custom neural network modules and uses an easy to read numpy-style code. In this post, I will detail several interesting points that arose during the reimplementation:

  • how to make a custom pyTorch LSTM with custom activation functions,
  • how the PackedSequence object works and is built,
  • how to convert an attention layer from Keras to pyTorch,
  • how to load your data in pyTorch: DataSets and smart Batching,
  • how to reproduce Keras weights initialization in pyTorch.

First, let’s look at the torchMoji/DeepMoji model. It is a fairly standard and robust NLP neural net with two bi-LSTM layers followed by an attention layer and a classifier:

torchMoji/DeepMoji model

How to build a custom pyTorch LSTM module

A very nice feature of DeepMoji is that Bjarke Felbo and co-workers were able to train the model on a massive dataset of 1.6 billion tweets. The pre-trained model thus carries a very rich representation of the emotions and sentiments in the training set and we would like to use the pre-trained weights.

However, the model was trained with Theano/Keras’ default activation for the recurrent kernel of the LSTM: a hard sigmoid, while pyTorch is tightly modeled around NVIDIA’s cuDNN library for efficient GPU acceleration which natively supports only LSTM with standard sigmoid recurrent activations:

Keras default LSTM VS pyTorch default LSTM

I thus wrote a custom LSTM layer with hard sigmoid recurrent activation functions:

This LSTM cell has to be integrated in a full module that can make use of all the pyTorch facilities (variable number of layers and directions, inputs as PackedSequences). This integration is quite long so I’ll refer you directly to the relevant file of the repo.

Writing a custom LSTM cell also means that we lose some of the easy and fast GPU capabilities of cuDNN. As we mainly want to use the pre-trained model in production on a CPU and maybe fine-tune a small classifier on top of it, this is not a problem for us, but it means that the model should be further adapted to make use of the GPU more efficiently if you would like to re-train it from scratch.

Attention layer: side-by-side Keras & pyTorch

The attention layer of our model is an interesting module where we can do a direct one-to-one comparison between the Keras and the pyTorch code:

pyTorch attention module
Keras attention layer

As you can see, the general algorithm is roughly identical but most of the lines in the pyTorch implementation are comments while a Keras implementation requires you to write several additional functions and reshaping calls.

When it comes to writing and debugging custom modules and layers, pyTorch is a faster option while Keras is clearly the fastest track when you need to quickly train and test a model built from standard layers.

How the PackedSequence object works

Keras has a nice masking feature to deal with variable lengths sequences. How do we do that in pyTorch? We use PackedSequences! PackedSequence is not very detailed in the pyTorch doc so I will spend some time describing them in greater details.

A typical NLP batch with five sequences and a total of 18 tokens

Let’s say we have a batch of sequences with variable lengths (as it is often the case in NLP application). To parallelize the computation of such a batch on the GPU we would like:

  • to process the sequences in parallel as much as possible given that the LSTM hidden state need to depend from the previous time step of each sequence, and
  • to stop the computation of each sequence at the right time step (the end of each sequence).

This can be done by using the PackedSequence pyTorch class as follow. We first sort the sequences by decreasing lengths and gather them in a (padded) tensor. Then we call the pack_padded_sequence function on the padded Tensor and the list of sequences lengths:

Packing a batch in a PackedSequence object

The PackedSequence object comprises:

  • a `data` object: a torch.Variable of shape (total # of tokens, dims of each token), in our simple case with five sequences of token (represented by integers): (18, 1)
  • a `batch_sizes` object: a list of the number of token per time-step, in our case: [5, 4, 3, 3, 2, 1]

How the pack_padded_sequence function constructs this object is simple:

How to construct a PackedSequence object (with batch_first=True)

One nice properties of the PackedSequence object is that we can perform many operations directly on the PackedSequence data variable without having to unpack the sequence (which is a slow sequential operation). In particular, we can perform any operation which is local in the tokens (i.e. insensitive to the tokens order/context). Of course, we can also apply any pyTorch Modules that accept PackedSequence inputs.

In our NLP model, we can, for example, concatenate the outputs of the two LSTM modules without unpacking the PackedSequence object and apply a LSTM on this object. We could also perform some operations of our attention layer without unpacking (like vector product, exponentiation).

Another thing to note is to be careful about the ordering of the label as you have now sorted the input sentence by length, you should sort the labels as well, using the permutation indices you got when you sorted the input:

labels = labels[perm_index]

Smart data loading in pyTorch: DataSets & Batches

In Keras, data loading and batching are often hidden in the fit_generator function. Again, this is nice when you want to quickly test a model but it also means we don’t fully control what is happening in this –rather critical– part of the model.

In pyTorch, we will combine three nice classes to do this task:

  • a DataSet to hold, pre-process and index the dataset,
  • a BatchSampler to control how the samples are gathered in batches, and
  • a DataLoader that will take care of feeding these batches to our model.

Our DataSet class is very simple:

Our BatchSampler is more interesting.

We have several small NLP datasets that we would like to use to fine-tune our model on emotion, sentiment and sarcasm detection. These datasets have varying lengths and sometimes unbalanced classes so we would like to design a batch sampler that could:

  • gather batches in epochs of pre-defined number of samples so our training process can be independent of the batches lengths, and
  • be able to sample in a balanced way from the unbalanced datasets.

In pyTorch, a BatchSampler is a class on which you can iterate to yield batches, each batch for the BatchSampler comprises a list with the indices of the samples to pick in the DataSet.

We can thus define a BatchSampler that will be initialized using a dataset class label vector to construct a list of batches fulfilling our needs:

From Keras to pyTorch: don’t forget the initialization

One last thing you have to be careful when porting Keras/Tensorflow/Theano code in pyTorch is the initialization of the weights.

Another powerful feature of Keras in term of speed of development is that the layers come with default initialization that makes a lot of sense.

On the contrary, pyTorch does not initialize the weights but let you free to do as you please. To get consistent results when fine tuning the weights we thus copy the default Keras initialization of the weights as follows:

Conclusion

Keras and pyTorch have differing philosophies and goals that we can feel when we compare the two frameworks directly on a single model.

In my opinion and experience :

  • Keras is great for quickly testing various ways to combine standard neural network blocks on a given task,
  • pyTorch is great to quickly develop and test a custom neural network module with a great freedom and an easy to read numpy-style code.

I took care to add a lot of comments in my pyTorch code and the original Keras implementation of DeepMoji is also well commented so don’t hesitate to walk through them, use them, and modify them.

Also, clap if you want us to share more of these! 🤗🤗🤗