Remarques sur la traduction de la machine neurale en apprenant ensemble à aligner et à traduire

Notes on Neural Machine Translation by Jointly Learning to Align and Translate

Published in

Paper Club

7 min readAug 23, 2017

Overall impression: this paper proposes a healthy new architecture for neural machine translation, and while it has some holes, the ideas presented are important and encouraging for this young field.

⁉️ Big Question

How can we teach machines how to translate between human languages at higher-than-human accuracy?

🏙 Background Summary

Baseline state-of-the-art machine translation results are currently sourced from phrase-based systems which tune sub-components to produce sentence translations.

Recently, neural machine translation approaches have been attempted. The most promising architecture is the encoder-decoder model, which includes an encoder RNN that reads the input sentence to produce a context c, and a decoder RNN that is trained to predict probabilities of the next word from c.

I wish they explained the “breakthrough” of neural machine translation beyond just “neural nets are cool and we are applying them to everything”. Reminds me of how fastText used a simpler architecture and outperformed neural net models

The biggest pitfall of encoder-decoder models is that they distill sentences of variable lengths into the same constant-length vector size, making it difficult to capture the information in longer sentences. The authors propose an approach that encodes each sentence into a set of more granular subvectors that can be chosen from, rather than one large vector. They call this “joint alignment and translation”.

Naively: what happens if you just increase the size of the context vector?

❓ Specific question(s)

Can an encoder-decoder approach that encodes each sentence into more than one subvector beat the state of the art in machine translation?

💭 Approach

The authors propose a new model architecture for neural machine translation. It is based on the encoder-decoder model.

Unlike previous approaches, the decoder produces probabilities based on distinct context vectors for each target word rather than one for the entire sentence.

The context is a weighted sum of annotations that are an encoded version of the input sequence. Each annotation is computed via an alignment model which scores how the inputs around a word’s position match the outputs. This is referred to as the “energy” of an annotation.

It was confusing to introduce the annotations before the encoder that seems responsible for producing them

The energy of an annotation is a proxy for its relative importance from the previous state to the current state in deciding the prediction.

This implements an attention mechanism in the decoder, which decides which annotations are most important and carry the highest signal.

I don’t have much background in attention, so I’m unsatisfied with the explanation “Intuitively, this implements a mechanism of attention in the decoder.” What’s a good way to deeply understand this? I am vaguely aware of the “attention matrix” and a softmax function fitting in there somewhere

The encoder is a bidirectional RNN which calculates forwards and backwards hidden states for an input sequence. The annotation for a word is computed by concatenating the forward and backwards hidden states.

This is a lot to take in, and will likely require multiple passes to really understand

⚗️ Methods

The authors built a model with this architecture for English->French translation, specifically on the ACL WMT ’14 dataset. This set contains 348M words. They use a list of the 30,000 most frequently occurring words in each language for training.

In addition to the proposed model, they trained an RNN Encoder-Decoder for comparison. It had 1000 hidden units each in the encoder and decoder.

The models were trained on two corpuses of text: one with sentences up to 30 words long, and the other with sentences up to 50 words.

I did not go through the appendices thoroughly, but the level of detail w.r.t the hyperparameters and training methods is very encouraging.

📓 Results

The above table shows the results of both models on the ACL WMT ’14 dataset. It’s especially notable that RNNsearch-50 outperforms Moses (state-of-the-art phrase-based translation system) on data that contains no unknown words.

These results also aid the authors’ hypothesis that the flexibility of RNNsearch without having to use a fixed-length context vector would improve performance on longer sentences. RNNsearch-30 is even able outperform RNNencdec-50.

The authors do not provide reasoning behind why RNNsearch performs worse on sentences with unknown words; this is non-obvious to me.

This visualization of the alignments between source and generated translations is helpful to perform qualitative analysis.

I’m thinking of alignment as “if two closely aligned words or phrases were mapped into a multi-dimensional space based on the model’s knowledge of their characteristics, they would be close to each other”. There isn’t a great definition in the paper and I had trouble finding one online as well; is this a reasonable mental model?

The benefits of soft alignment can be demonstrated in the translation of “the man” to “l’homme” (bottom right of figure 3(d)). Hard alignment would separate “l’ ” and “homme” completely, but in reality the two must be considered together as they carry an important and related signal.

The authors also performed a qualitative analysis for their hypothesis regarding long sentences. These examples demonstrate that RNNsearch is better than RNNencdec at retaining context and information from early on in long sentences.

🤠 Conclusion

Based on the quantitative and qualitative tests they ran, the authors conclude that their proposed RNNsearch architecture correctly aligns each target word with the relevant source words, which generates better translations than the previous state-of-the-art.

It’s very encouraging that the RNNsearch matched up well with phrase-based translation systems, especially since neural machine translation approaches had only been around for one year at the time of this paper’s writing.

The results are certainly encouraging, but one glaring hole (to me) is that they only tested their architecture on one language pair, while phrase-based models have been proven across many.

The authors believe that the next step is to improve RNNsearch’s performance on unknown tokens, an area in which it still falls well short of phrase-based translation. I agree that this seems like an obvious next step.

I think another next step would be to compute n-gram contexts and use them as features alongside single-word contexts, similar to the way fastText improved its performance with n-gram features.

👂 Questions

What is the significance of “jointly” in the title? h/t Jason M: “There are two separable problems here; word translation (européenne -> european) and alignment (the reversal in “zone économique européenne” -> “european economic zone”). Instead of doing these as two separate models, they build one big neural net that does both jobs, and lets SGD flow through the whole thing so their weights get trained jointly.”

⏩ Viability as a Project

The architecture proposed in this paper is definitely viable for a project. The authors are very thorough with the description of their process, and replicating the steps they took seems quite approachable.

Machine translation work translates (😉) almost directly to real-world applications; if you were able to verifiably beat the current state of the art, your architecture would simply replace the state-of-the-art applications.

My only potential cause for concern would be the size of the data (348M words), but my intuition is that commodity hardware could handle this load.

🔁 Abstract

The abstract hits all of the key points in the paper concisely and accurately, and it fits my interpretation well.

Fixed-length vector as a bottleneck for encoder-decoder architectures
Soft-search for highest-signal parts of source sentences as a way around the above bottleneck
Qualitative and quantitative analysis included in the results

🗣 What do other researchers say?

This paper has a solid 1800 citations at the time I am reading it, just one short year after its release. This is quite promising!

Denny Britz seems more or less neutral on the paper; he is curious about the effectiveness of more complicated attention functions beyond the simple weighted average employed in the paper

📚 Other Resources

“Proceedings of the Empiricial Methods in Natural Language Processing” is the origin of the RNNencdec architecture that this paper so commonly references.

🤷‍ Words I don’t know

neural machine translation: a new approach to machine translation that operates a neural network at the sentence level rather than the phrase level
soft-search: ? couldn’t find a technical definition. My assumption is that the authors use it interchangeably with beam search (defined below)
conditional distribution: a probability distribution for a sub-population. In other words, it shows the probability that a randomly selected item in a sub-population has a characteristic you’re interested in
attention mechanism: a method to avoid encoding full source sentences into fixed-length vectors. Rather, it allows the decoder to “attend” to different parts of the source sentence at each step of the output generation
beam search: a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set (best-first)