Paper Summary: Neural Machine Translation by Jointly Learning to Align and Translate

5 min readNov 24, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/15, with better formatting.

Neural Machine Translation by Jointly Learning to Align and Translate (2014) Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

Last time we looked at machine translation (MT) as a way to pretrain language models for other tasks. Today we’ll look at MT for its own sake. Actually, that’s not quite true, it’s more like we’re looking at neural MT architectures for their own sakes.

In any case, the background is that models that use RNN (specifically LSTM) encoders and decoders have done pretty well on this task, but are limited in their ability to track long-term dependencies and even, tellingly, lose their ability to translate the end of long sentences correctly. The cause of this limitation is that this “basic encoder-decoder” architecture encodes everything about the input sentence in a single fixed-length vector (the encoder RNN’s final hidden state). This is not ideal, since we expect intermediate hidden states to contain useful information, and we’d like that information not to have to travel so far to get to where it will be useful.

This paper addresses the single node bottleneck problem in two ways: first by using a bidirectional LSTM for input (this is not a new idea — the citation is from Schuster and Paliwal 1997), and second by introducing an alignment model, a matrix of weights connecting each input location to each output location. This can be thought of as an attention mechanism that allows the decoder to pull information from useful parts of the input rather than having to decode a single hidden state.

The model, in a bit more detail: the biLSTM (actually two LSTMs, one running forward over the input sequence, the other running backward) outputs two hidden states for each input location j. The forward hidden state contains information from earlier in the sequence and the backward from later in the sequence; still, these states are primarily about the word at j because RNNs forget easily. The states are concatenated, forming what the authors call an annotation. To sum up, the annotation hj is a representation of the word at j, with some context.

Now on to the decoder, which consists of the alignment model and another LSTM. For each decoder step, we feed the previous decoder hidden state and all the input annotations hj into a feedforward alignment model a, take the softmax over the j dimension, and use the result to combine the annotations in a weighted sum, producing a context vector ci. This ci is the input of the decoder LSTM (along with the standard previous hidden decoder state and previous output word), which produces the current hidden state and current output word in turn.

Symbolically, we start by computing the alignment:

where si-1 is the previous decoder hidden state, hj is the annotation, a is the feedforward alignment model, and eij is interpreted as an energy (roughly, the importance of the annotation for generating the current output). Then we have the softmax and weighted sum, resulting in the context vector ci:

Then we have the hidden state and a probability distribution over the output:

where f and g are the decoder LSTM and a maxout network respectively.

(The paper describes these steps in reverse order, which I found a bit harder to understand, but YMMV. The confusing thing about this order, I suppose, is that the previous decoder hidden state finds its way in at the very beginning.)

Training and results. The authors trained a couple of models on different length sentences (up to 30 vs 50 words) and with and without the attention mechanism. Models were trained for ~5 days. Output sentences were generated with beam search (the idea that choosing the next word greedily generates poor quality sentences — imagine just accepting your phone’s predictive typing for every word — so consider short sequences of words instead).

One interesting comment: “We do not use any monolingual data other than the mentioned parallel corpora, although it may be possible to use a much larger monolingual corpus to pretrain an encoder.” This prefigures some of the recent impressive transfer learning results in NLP that I’ll be covering in the coming few days.

As hoped, the quality of long sentence translations was improved: the baseline RNNencdec-50 model (vanilla LSTM, trained on 50-word sentences) produced poor quality long sentences, compared even with RNNsearch-30 (the model with attention, trained on the shorter sentences). They also visualized the weights generated by the alignment model at every input and output word location (see screenshots below), which accord with intuition:

In the first example, “European Economic Area” maps to the order-reversed “zone économique européenne”. In the second example we see the benefit of soft attention, where “the man” maps to “l’ homme”. Since in French the definite article can take several forms (l’/la/le/les), that row splits its attention across “the” and “man” to derive the proper form.

Another note: the appendices are very readable and all the layers are fully described. So if you’re curious, don’t hold back.

Schuster and Paliwal 1997 “Bidirectional recurrent neural networks” https://ieeexplore.ieee.org/document/650093/

Paper Summary: Neural Machine Translation by Jointly Learning to Align and Translate

Written by Mike Plotz