Attention Seq2Seq with PyTorch: learning to invert a sequence

TL;DR: In this article you’ll learn how to implement sequence-to-sequence models with and without attention on a simple case: inverting a randomly generated sequence.

You might already have come across thousands of articles explaining sequence-to-sequence models and attention mechanisms, but few are illustrated with code snippets. 
Below is a non-exhaustive list of articles talking about sequence-to-sequence algorithms and attention mechanisms:

Attention and sequence-to-sequence models

These models are used to map input sequences to output sequences. A sequence is a data structure in which there is a temporal dimension, or at least a sense of “order”. Think of translating French sentences (= sequences) to English sentences, or doing speech-to-text (audio -> text), or text-to-speech, etc. 
“Attention” is a variant of sequence to sequence models and allowed major improvement in the fields above.

Source: http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf

Here is a very simple illustration of a sequence-to-sequence model. The goal is to translate from German to English by feeding words in red to the Encoder (red arrows), and outputting sequentially English words (blue arrows). Notice how h3 is important as all the future (blue) predicitions rely on it…

As a machine learning engineer, I started working with Tensorflow a couple of years ago. Without taking sides in the PyTorch-vs-Tensorflow debate, I reckon Tensorflow has a lot of advantages among which are the richness of its API and the quality of its contributors. The main drawback I found was that it is hard to debug, and as time went by, I got a bit frustrated by the difficulty of customizing my network (defining custom losses for example..). 
I consider deep learning isn’t just copying and pasting bits of code from the web, but being able to implement your own network and understanding every pitfall.

I therefore decided to take a step towards PyTorch.

PyTorch is not perfect, but it has the advantage of being more pythonic and its dynamic computation graph, as opposed to Tensorflow, makes it easier to debug and to run unit tests (you do use tests in your code, right ?). Above all, PyTorch offers a nice API (though not as furnished as Tensorflow’s) and enables you to define custom modules. Forcing you to rewrite modules allows you to understand what you are doing. For instance, I’ve been using the Tensorflow AttentionWrapper when designing seq2seq models in the past, but implementing a custom attention module in PyTorch allowed me to fully understand the subtleties of it. 
Enough talking, let’s dive into the code.

If you want to start playing without reading any further, the code is available here : https://github.com/b-etienne/Seq2seq-PyTorch

Problem we want to solve

Our goal is to compare the efficiency of two models on a simple task which consists in learning how to invert a given sequence. The two models are sequence-to-sequence models with an Encoder and a Decoder, one with an attention mechanism and one without.

Why ?

This problem is one of the simplest sequence-to-sequence tasks you can think of. If you however decide to develop your own architecture and it fails on this simple task, it’s probably not going to lead anywhere on more complex tasks…So getting our model to work on this task is a good way to make sure our code is working.

Generating data

We generate random sequences of variable length picked from an ensemble, or “vocabulary”, of four letters “a”,”b”,”c”,”d”. The target is the reverted input, such as: “abcd” -> “dcba”.

The Encoder

The encoder is the “listening” part of the seq2seq model. It consists of recurrent layers (RNN, GRU, LSTM, pick your favorite), before which you can add convolutional layers or dense layers. The important part here is the use of the pack_padded_sequence and pad_packed_sequence helpers before feeding your data in the encoder.
 As we are using batches of data, each item (sequence) in the batch has a different length. We pad all sequences in the batch with 0s up to the length of the longest sequence (this is a classic process in variable length batches and can you find plenty of posts on this subject online). The pack_padded_sequence and pad_packed_sequence help us to deal with these uninformative paddings when feeding data to the encoder.

The Decoder

The Decoder is the module responsible for outputting predictions which will then be used to calculate the loss. To put it in a nutshell, the Decoder with attention takes as inputs the outputs of the decoder and decides on which part to focus to output a prediction. Without attention, only the last hidden state from the encoder is used. 
In our case, the Decoder operates sequentially on the target sequence and at each time step takes as inputs:
- an input (the next character in the target sequence or the previously emitted character)
- a hidden state
- other arguments depending on the fact you’re using attention or not.

As we will use different encoder, let’s define a base Decoder module:

RNN Decoder

The simplest form of sequence to sequence models consists in a RNN Encoder and an RNN Decoder. Here is what our RNN Decoder looks like:

Attention Decoder

Now here’s the part where we will use Attention. Again, if you want to understand the theory behind, refer to the links at the beginning of the article. It is still unclear to me whether or not the recurrent cell should be called before or after computing the attention weights. I tried both methods, and they both lead to convergence. I’d appreciate any help on this particular point in the comments below should you know the answer.
Here is what our AttentionEncoder would look like (I refer to the “Listen Attend and Spell” paper in my code):

Attention Scoring function

At the heart of AttentionDecoder lies an Attention module. This module allows us to compute different attention scores. The two main variants are Luong and Bahdanau. Luong is said to be “multiplicative” while Bahdanau is “additive”. Details can be found in the papers above.

We therefore define a custom Attention module with options to calculate similarity between tensors (which are sometimes referred as keys and query). The scores are normalized to define probabilities using a softmax function of the similarity scores. This module uses a custom mask_3d function used for masking paddings in the weights computation. I’ll leave it up to you to implement it :)

Putting it all inside a Seq2Seq module

Once our Encoder and Decoder are defined, we can create a Seq2Seq model with a PyTorch module encapsulating them. I will not dwell on the decoding procedure but just for your knowledge we can choose between Teacher forcing and Scheduled sampling strategies during decoding. If you’ve never heard of it before, make sure you look them up online. I’ll leave this module as a homework for you but you can contact me if you need any help.

Results

We use the cross-entropy loss during training. During evaluation, we calculate the accuracy as the Levenshtein distance between the predicted sequence and the target sequence. The seq2seq model without attention reaches a plateau while the seq2seq with attention learns the task much more easily:

Let’s visualize the attention weights during inference for the attention model to see if the model indeed learns. As we can see, the diagonal goes from the top left-hand corner from the bottom right-hand corner. This shows the network learns to focus first on the last character and last on the first character in time:


Oh and by the way, if you’re interested in sequence-to-sequence models, I wrote an article on Neural Turing Machines here :)