Recent advances in machine translation have enabled innovative use cases such as voice interfaces and early medical diagnosis. From the old days of rule-based phrase-to-phrase translation to today’s neural machine translation, the machine translation problem poses numerous elemental challenges that bring about brilliant solutions. Here, we will go over the latest development in Neural Machine Translation at a high level. A great variety of tutorials on these latest publications already exist, and we will also point to those that are beginner friendly.
Fundamentally, a machine translation task is to map a sentence in one language to a sentence in another language. For notation, we will use x for each word in the source language, and y for each word in the target language.
Sequence to Sequence Translation
This is the previous state-of-the-art translation model by Sutskever et al.. Jay Alammar has produced beginner friendly tutorial on the paper.
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)
May 25th update: New graphics (RNN animation, word embedding graph), color coding, elaborated on the final attention…
In a nutshell, a pair of RNNs, one acts as the encoder and the other as the decoder, is trained on sentence pairs in both languages (thus the name “sequence to sequence”). One noticeable innovation in this model is the reversal of source sentences.
For example, if we were to translate from je suis étudient (French) to I am a student, a straight-forward approach would be:
h = RNNencode(je, suis, étudient)
RNNdecode(h) → [I, am, a, student]
and we train the RNN pair to match as many sentence pairs as possible. The paper’s approach is to reverse the source (French) sentence:
h = RNNEncode(étudient, suis, je)
RNNDecode(h) → [I, am, a, student]
This way, “je” is very close to “I” in terms of RNN time-steps, and “suis” is sufficiently close to “am”. You might argue that “étudient” is now farther to “student”, but it turned out that putting only a few matching words close to each other helped the model overall by nearly 5 BLEU points. The original authors did not have a complete explanation for this.
This model has the following imperfections:
- Long sentences: due to the fixed-dimension of the hidden state (context) in the RNN, long sentences with complex meanings cannot fit inside the hidden state.
- Training cost: 10 days on 8 GPU’s for WMT English -> German.
- Inference cost: inference is autoregressive, meaning inferring the next word depends on the previous output, not concurrently.
- Dataset: requires paired sentences.
RNN encodes a sentence into a fixed-length vector. This poses a difficulty for long sentences and an overkill for short sentences. In Neural Machine Translation by Jointly Learning to Align and Translate[Bahdanau et al., 2014], the authors achieve near state-of-the-art performance by explicitly modelling any other part of the sentence to participate in encoding a particular word. This starts to sound like attention. Effective Approaches to Attention-based Neural Machine Translation [Luong et al., 2015] explores different attention-based architectures and achieved new state-of-the-art. A newer model proposed by [Vaswa ni et al., 2017] throws away the RNN encoder and is able to achieve better results using something they coined self-attention — an essential part of the Transformer architecture. Jay Lammar also produced an illustrated tutorial on this paper:
The Illustrated Transformer
In the previous post, we looked at Attention — a ubiquitous method in modern deep learning models. Attention is a…
The tutorial does a decent job and explaining the mechanism. This additional illustration from Google AI blog explains Self-Attention in a nutshell.
The encoder is trained in parallel, and teacher forcing is applied to train the decoding in parallel (not shown in the illustration). The training cost reduced to 12 hours (with slightly more powerful, newer GPU’s), compared to the seq2seq model. Note that the decoder step is still autoregressive at inference time.
Transformers are able to speed up training significantly. Gu et al. claim that non-autoregressive decoding models can do almost as well, with significant speed increase at inference time.
In previous models, inference is time-step based. In producing “I am a student”, the model first produce “I”, then produce “am” based on the fact that it has already produced “I”, and so on. In probabilistic terms, each output word is conditioned on the source sentence and all previously produced words.
However, this inherently sequential process makes translation of long sentences slow. What Gu et al show in the paper is that the next word does not have to depend on the previous word. Instead, the model takes as additional input the total length of output expected and the current output’s position.
nextWord = f(meaning, i, n)
where meaning is obtained from the source sentence, i is the index of nextWord, n is the length of the output sentence
Continuing with the toy example of Je suis étudient → I am a student:
Output length n = 4
f(meaning, 1, 4) = I
f(meaning, 2, 4) = am
f(meaning, 3, 4) = a
f(meaning, 4, 4) = student
We can now train f such that it produces the desired output. Notice each word can now be produced in parallel. In probabilistic terms, this assumes independence among output words, conditioned on the source sentence (meaning), the output total length, the current index.
How do we know the output length beforehand?
We don’t. However, the model has a “fertility” mechanism to make an educated guess based on the source sentence. This is not yet as accurate as auto-regressive models in some tasks.
Note the Fertility Predictor (red bubble) which dictates how many output token to be derived from an input token. In this case, “accept” is being processed twice, but with a different position encoding. This way, we can produce the 3rd and 4th token in parallel, knowing beforehand that the word “accept” will have two matching outputs. Compared to auto-regressive decoding, the second “accept” will depend on the fact that it has already decoded “accept” once to determine if it is necessary to generate another one to complete the meaning of “accept”.
Previous work in unsupervised machine translation has achieved inferior performance compared to supervised ones. However, there are practically infinite amount of monolingual resources digitally available. Recent advances in unsupervised machine translation can perform as well as supervised ones in some language pairs.
Artetxe et al. have produced results comparable to previous supervised-translations on WMT datasets. The fundamental structure is an auto-encoder network with a shared encoder but separate decoders for each language.
To ensure the encoder-decoder system is capturing the meaning of both languages instead of devoting a separate dimension for each language, cross-lingual embedding is used to ensure overlapping embeddings. Thus, the encoder acts as a language-agnostic meaning extractor, and the decoder simply projects the meaning a particular language. An implementation is provided by the original authors on https://github.com/artetxem/monoses.
A common training technique in unsupervised machine translation is cycle-consistency, which means English → German → English should produce the same English text. This technique is inspired by unsupervised image transformation models such as [CycleGAN]. This is evident in both the model above and the one below.
Subsequent research has been done by Lample et al at Facebook. The fundamental assumption is that even though each language has a different set of vocabulary, they all describe the same thing “because people across the world share the same physical world”. Thus, words in one vocabulary, if embedded properly, should project onto another language’s embeddings in some high dimensional space, and a translation is simply a nearest neighbor search on each word/phrase. On top of that, post-processing and self-language training are similarly performed to fine-tune the performance.