Neural Machine Translation Today

A brief tour of important concepts to NMT today

Published in

Lion IQ

7 min readOct 7, 2017

In 2016 Google Translate deployed Neural Machine Translation models into production and suddenly, deep learning and the “AI Awakening” was announced to the world. Released in 2006, this would mark the 10th year Google Translate had been in operation to internet users across the world. It would also mark the departure from phrase-based Statistical Machine Translation models to Neural Machine Translation based models.

Google Translate now uses NMT for all its translation

First introduced by Kalchbrenner and Blunsom (2013), Neural Machine Translation models started to be competitive at academic machine translation conferences in 2015, and by 2016 had surpassed SMT in almost all language translation tasks.

While the achievements of NMT are numerous and rapid, we will highlight a few papers and concepts that have defined NMT today. We will also touch on a couple recent papers to try and infer where NMT is headed.

Sequence to sequence learning

The seminal paper by Sutskever et al (2014) introduced sequence to sequence learning (seq2seq), a method using Recurrent Neural Networks for both encoder and decoder. To address the issue of vanishing gradients that plague deep RNN models, they use Long Short-Term Memory networks. The paper also note some important observations:

seq2seq have no trouble with long sentences, but performs better on long sentences when the source sentence is fed into the network in reverse.
seq2seq learns sensible phrase and sentence representations that are sensitive to word order.

sequence to sequence learning, using LSTMs

Attention Mechanisms

In LSTM seq2seq models, the decoder generates translation based on only the last hidden state of the encoder. Empirically, translation quality is worse for longer sentences.

Models can improve by feeding source sentences in reverse.
Models can improve by feeding an input sequence twice.

The seminal work of Bahdanau et al (2014) introduced attention mechanisms to seq2seq models. Attention mechanisms allow connections between each encoder and decoder hidden states, so that the target word will be predicted based on combination of context vectors, rather than just the previous hidden state of the decoder. Intuitively, this makes sense; only a small portion of the source sentence is relevant to translating a target word.

In 2016, NMT Attention models achieved top results at almost all translation tasks. The achievements of attention mechanisms are so good, that Chris Manning once opined that you could throw bi-LSTMs at any task in NLP, (and add attention if you need information flow).

seq2seq with attention. Purple lines demonstrate attention mechanism. image from https://research.googleblog.com/2016/09/a-neural-network-for-machine.html

There is an additional side bonus to attention mechanisms; they can be used to visualize the contributing weight of each source word or subword to each target word, making attention mechanism more interpretable.

Subword Encoding

NMT typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Rare or unknown words (or worse, spelling errors) are a problem for NMT as it relies on word embeddings. NMT models don’t have an array of grammar and language components as in SMT models.

Previously, sentences were fed into NMT models as a series of word tokens, and a variety of methods studied this problem.

Increase the size of the source and target vocabulary. However this increases the training time and decoding complexity.
Tokenize at character level. This would make the vocab size very small, but increase the length of sequences, again increasing training time and decoding complexity.
Handle rare or unknown words via other methods such as using dictionary methods, or use part-of-speech taggers to recognize the word or phrase as proper nouns.

Sennrich et al (2016) introduced subword units, and provided an argument that word tokens are not optimal. To build a vocabulary of subwords, the paper adapts Byte-Pair Encoding, a compression algorithm, to word segmentation.

BPE is more efficient than storing all known words. It is after all, a compression algorithm.
At worst, a word string can be encoded as a series of characters, and avoids complicated dictionary fallbacks. As an open vocabulary strategy, this eliminates out-of-vocab words.
While the paper is titled as a strategy to deal with rare words, subword encoding outperforms full-word encoding across the board, including language pairs that don’t share alphabets such as English to Chinese.

Convolutional Sequence to Sequence learning

The dominant approach to NMT thus far encodes source sentences with a series of bi-directional LSTMs and generates a variable length output with another set of decoder RNNs, both with attention mechanisms. Convolutional neural networks are commonly applied with great success in image recognition tasks and have several advantages, chiefly being parallelization for GPU processing. RNNs in comparison maintain a hidden state that depend on the previous hidden states in the sequence.

Facebook AI’s Gehring et al (2017) present a fully convolutional architecture, which create hierarchical representations over the input sequence in which nearby input elements interact at lower layers while distant elements interact at higher layers. Although convolution filters have fixed width kernels, by stacking convolution layers all the words or subwords in a sentence are eventually connected, capturing long range dependencies.

For the model to be competitive the authors also devised additional architecture designs:

Positional embeddings added to the input embeddings, to capture a sense of sequential order.
Multi-step attention, where attention is computed using current decoder state and embedding of previous target word token.

While previous convolutional methods were unable to unseat the LSTM Attention hegemony, ConvS2S achieves state of the art results surpassing previous bi-LSTM attention models while easily reducing training time by an order of magnitude.

This achievement raises the question; if sequence to sequence learning can be learned without using RNNs, do attention mechanisms contribute much more to NMT than we think?

Attention is all you need

Google Brain’s Vaswani et al (2017) propose a new architecture, the Transformer, based solely on attention mechanisms. Instead of recurrent or convolutional layers, Transformer introduces “self-attention” layers in the encoder and decoder.

Self-Attention layers connects all subword embeddings in the sentence, effectively allowing the network to learn inter-dependencies in constant time. This mechanism addresses the issue of long-range dependencies that have plagued RNN architectures. While ConvS2S addresses this issue using a stack of convolution layers, the path length would still be O(n/k).

To facilitate self-attention layers, the paper introduces multi-head attention mechanisms. Rather than learning a single attention function, multi-head attention learns and concatenates 8 different attention functions to form an output context vector. Attention is relatively simple calculation compared to recurrence or convolution, and multi-head attention can be parallelized efficiently. Intuitively, self-attention is learning a positional interdependency between individual subwords in source, and target sentences. As a side bonus, this makes self-attention more interpretable than recurrent or convolution layers.

With the Transformer model setting a new benchmark score, this would mark the second time in 2017 that the bar was raised yet again by a non-recurrent model. Even more impressively, the Transformer model displays remarkable efficiency, with a base model whole orders of magnitude faster than even ConvS2S.

I think this marks an exciting time in Neural Machine Translation as we start to see more efficient representations of translation learning. Its not clear yet if the new architectures have ended the dominance of LSTM Attention models, but a new class of models that are faster to train and would make NMT experimentation more accessible and repeatable.

We’re not there yet, but we’re making big strides

Frameworks

With a multitude of open source software, NMT has never been more available and accessible to amateurs and startups alike.

Tensorflow-NMT is a great place to start for beginners, and has a fantastic tutorial to explain many NMT concepts in practical code.
Nematus is a fully featured NMT framework that the University of Edinburgh team has used to consistently place top results at WMT.
Fairseq is a seq2seq and convS2S framework from Facebook AI based on Torch. I personally find it to be very fast.
Tensor2Tensor is a fantastic framework used by the Transformer paper authors. It has a very comprehensive end-to-end approach to NMT experiments.