# History and Frontier of the Neural Machine Translation

Machine translation (MT) is utilizing the power of machines to do “automatic translation of text from one natural language (the source language) to another (the target language)” [1]. The idea of doing translation using machines was first raised by Warren Weaver in 1949. For a long time (1950s~1980s), machine translation was done through the study of the linguistic information about the source and target languages, generating translations based on the dictionaries and grammars, which is called rule-based machine translation (RBMT). With the development of Statistics, statistical models started to be applied to machine translation, which generates translations based on the analysis of bilingual text corpus. This method is known as the statistical machine translation (SMT), which gained better performance than RBMT and dominated the field from the 1980s to 2000s. In the year of 1997, Ramon Neco and Mikel Forcada came up with the idea of using “encoder-decoder” structure to do machine translations [2]. A few years later in 2003, a group of researchers at the University of Montreal led by Yoshua Bengio developed a language model based on neural networks [3], which improved the data sparsity problem of traditional SMT models. Their work laid a foundation for the future usage of neural networks on machine translation.

# The Birth of Neural Machine Translation

In 2013, Nal Kalchbrenner and Phil Blunsom proposed a new end-to-end encoder-decoder structure for machine translation [4]. This model will encode a given source text into a continuous vector using Convolutional Neural Network (CNN), and then use Recurrent Neural Network (RNN) as the decoder to transform the state vector into the target language. Their work can be treated as the birth of the Neural Machine Translation (NMT), which is a method that uses deep learning neural networks to map among natural language. NMT’s nonlinear mapping differs from the linear SMT models, and describes the semantic equivalence using the state vectors which connect encoder and decoder. In addition, the RNN is supposed to be capable of capturing information behind an infinite length of sentences and solving the problem of “long distance reordering” [29]. However, the problem of “exploding/vanishing gradient” [28] makes RNN hard to actually handle the long distance dependencies; accordingly, the NMT model did not achieve a good performance at the beginning.

# Memory for the Long Distance

One year later (2014), Sutskever et al. and Cho et al. developed a method called sequence to sequence (seq2seq) learning using RNN for both encoder and decoder [5][6], and introduced the Long Short-Term Memory (LSTM, a variety of RNN) for NMT. Thanks to the gate mechanism that allows for explicit memory deletes and updates in LSTM, the problem of “exploding/vanishing gradients” is controlled so that the model can capture “long-distance dependencies” in a sentence much better. The introduction of LSTM solved the “long distance reordering” problem while shifted the primary challenge of NMT to the problem of “fixed-length vector”: As shown in Figure 1, no matter how long or short the source sentence is, the neural network needs to compress the source sentence into a fixed-length vector, which will lead to increasing complexity and uncertainties during decoding especially when the source sentence is long [6].

*Figure 1: Mechanism of the original neural machine translation without “attention” [5]*

# Attention, Attention, Attention

The problem of “fixed-length vector” started to be solved since Yoshua Bengio’s group introduced the “attention” mechanism to NMT [7] in 2014. The attention mechanism was originally proposed by DeepMind for image classification [23], which “enables the neural network to focus on relevant parts of input more than irrelevant parts when doing a prediction task” [24]. When the decoder is generating a word to form the target sentence, only a small portion of the source sentence are relevant; thus a content-based attention mechanism is applied to dynamically generate a (weighted) context vector based on the source sentence (As shown in Figure 2, the transparency of the purple line demonstrates the weights). The target word will then be predicted based on the context vectors instead of a fixed-length vector. The performance of NMT was dramatically improved since then and the “attentional encoder-decoder networks” has become the state of the art model in the field of NMT.

*Figure 2: Mechanism of the “attentional encoder-decoder networks” architecture from Google Neural Machine Translation (GNMT) [8]*

# NMT vs. SMT

Although the NMT had made remarkable achievements on particular translation experiments, researchers were wondering if the good performance persists on other tasks and can NMT indeed replace SMT. Accordingly, Junczys-Dowmunt et al. performed experiments on the “United Nations Parallel Corpus” which involves 15 language pairs and 30 translation directions, and NMT was either on par with or surpassed SMT across all 30 translation directions in the experiment measured through BLEU scores (“a method for automatic evaluation of machine translation”, the higher, the better [33]). Besides, on the Workshop on Statistical Machine Translation (WMT) competition in 2015, the team from University of Montreal successfully won the first place on English-German translation and third places on German-English, Czech-English, English-Czech translations using NMT [31]. Compared with SMT, NMT can train multiple features jointly and does not need prior domain knowledge, which enables zero-shot translation [32]. In addition to higher BLEU score and better sentence structure, NMT can also help reduce morphology errors, syntax errors, and word order errors which were commonly seen on SMT. On the other side, there are still problems and challenges of NMT need to be tackled: The training and decoding process is quite slow; the style of translation can be inconsistent for the same word; there exists an “out-of-vocabulary” problem on the translation results; the “black-box” neural network mechanism leads to poor interpretability; thus the parameters for training are mostly picked based on experience.

# The Arms Race is on

Because of the characteristics of NMT and its superiority over SMT, NMT also starts to be adopted by the industry recently: In September 2016, the **Google **Brain team published a blog showing that they had started using NMT to replace Phrase-Based Machine Translation (PBMT, a variety of SMT) for Chinese-English translations on their product — Google Translate [8]. The NMT they deployed is named Google Neural Machine Translation (GNMT), and a paper was published at the same time to explain that model in details [9]. In just one year (2017), **Facebook **AI Research (FAIR) announced their way of implementing NMT with CNN, which can achieve a similar performance as the RNN-based NMT [10][11] while running nine times faster. In response, Google released a solely attention-based NMT model in June which used neither CNN nor RNN and purely based on the “attention” mechanism [12]. Other tech giants such as **Amazon **just released their NMT implementation with MXNet in July [13]; **Microsoft **talked about their usage of NMT in 2016, although not revealed any further technical details yet [27]. **IBM **Watson (the veteran in machine translation), **NVIDIA **(the leader in AI computing), and **SYSTRAN **(the pioneer in machine translation)** **[35] all took part in the development of NMT more or less. In the Far East, a rising star in the field of AI, China, even more companies including** Baidu, NetEase-Youdao, Tencent, Sogou, iFlytek, Alibaba**, etc. have already deployed NMT. All of them are trying their best to gain a competitive advantage in the next round of machine translation evolution.

# Is NMT the Future?

The NMT tehcnique is experiencing considerable development under the fast paced and highly competitive environment. At the latest ACL 2017 conference, all 15 papers accepted under the machine translation category are about the neural machine translation [34]. We can see that improvements on NMT are continuously made in various aspects including:

- Rare word problem [14] [15]
- Monolingual data usage [16] [17]
- Multiple language translation/multilingual NMT [18]
- Memory mechanism [19]
- Linguistic integration [20]
- Coverage problem [21]
- Training process [22]
- Priori knowledge integration [25]
- Multimodal translations [26]

Therefore, we have every reason to believe that NMT will achieve greater breakthroughs, gradually developed as the mainstream machine translation technique to replace the SMT, and benefit the whole society in the near future.

# One More Thing

To help you experience the magic of NMT, we listed some open source implementations of NMT using different tools here for you to learn by doing:

**Tensorflow**[Google-GNMT]: https://github.com/tensorflow/nmt**Torch**[Facebook-fairseq]: https://github.com/facebookresearch/fairseq**MXNet**[Amazon-Sockeye]: https://github.com/awslabs/sockeye**Theano**[NEMATUS]: https://github.com/EdinburghNLP/nematus**Theano**[THUMT]: https://github.com/thumt/THUMT**Torch**[OpenNMT]: https://github.com/opennmt/opennmt**PyTorch**[OpenNMT]: https://github.com/OpenNMT/OpenNMT-py**Matlab**[StanfordNMT]: https://nlp.stanford.edu/projects/nmt/**DyNet-lamtram**[CMU]: https://github.com/neubig/nmt-tips**EUREKA**[MangoNMT]: https://github.com/jiajunzhangnlp/EUREKA-MangoNMT

If you have the interest to know more about NMT, you are encouraged to read the papers listed in the reference section: [5][6][7] are must-read core papers to help you know about the NMT. [9] is a comprehensive demonstration of the NMT’s mechanism and implementation. Besides, we covered machine translation as one section in our ongoing AI Tech Report. A sneak peek of the report is available here, and you may sign up to receive the full report once it is released.

# References

[1] Russell, S. & Norvig, P. (1995). Artificial intelligence: a modern approach.
[2] Neco, R. P., & Forcada, M. L. (1997, June). Asynchronous translations with recurrent neural nets. In *Neural Networks, 1997., International Conference on* (Vol. 4, pp. 2535–2540). IEEE.
[3] Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. *Journal of machine learning research*, *3*(Feb), 1137–1155.
[4] Kalchbrenner, N., & Blunsom, P. (2013, October). Recurrent Continuous Translation Models. In *EMNLP* (Vol. 3, №39, p. 413).
[5] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In *Advances in neural information processing systems*(pp. 3104–3112).
[6] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. *arXiv preprint arXiv:1406.1078*.
[7] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. *arXiv preprint arXiv:1409.0473*.
[8] *A Neural Network for Machine Translation, at Production Scale*. (2017). *Research Blog*. Retrieved 26 July 2017, from https://research.googleblog.com/2016/09/a-neural-network-for-machine.html
[9] Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … & Klingner, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. *arXiv preprint arXiv:1609.08144*.
[10] Gehring, J., Auli, M., Grangier, D., & Dauphin, Y. N. (2016). A convolutional encoder model for neural machine translation. *arXiv preprint arXiv:1611.02344*.
[11] Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. *arXiv preprint arXiv:1705.03122*.
[12] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention Is All You Need. *arXiv preprint arXiv:1706.03762*.
[13] *Train Neural Machine Translation Models with Sockeye | Amazon Web Services*. (2017). *Amazon Web Services*. Retrieved 26 July 2017, from https://aws.amazon.com/blogs/ai/train-neural-machine-translation-models-with-sockeye/
[14] Jean, S., Cho, K., Memisevic, R., & Bengio, Y. (2014). On using very large target vocabulary for neural machine translation. *arXiv preprint arXiv:1412.2007*.
[15] Luong, M. T., Sutskever, I., Le, Q. V., Vinyals, O., & Zaremba, W. (2014). Addressing the rare word problem in neural machine translation. *arXiv preprint arXiv:1410.8206*.
[16] Sennrich, R., Haddow, B., & Birch, A. (2015). Improving neural machine translation models with monolingual data. *arXiv preprint arXiv:1511.06709*.
[17] Cheng, Y., Xu, W., He, Z., He, W., Wu, H., Sun, M., & Liu, Y. (2016). Semi-supervised learning for neural machine translation. *arXiv preprint arXiv:1606.04596*.
[18] Dong, D., Wu, H., He, W., Yu, D., & Wang, H. (2015). Multi-Task Learning for Multiple Language Translation. In *ACL (1)* (pp. 1723–1732).
[19] Wang, M., Lu, Z., Li, H., & Liu, Q. (2016). Memory-enhanced decoder for neural machine translation. *arXiv preprint arXiv:1606.02003*.
[20] Sennrich, R., & Haddow, B. (2016). Linguistic input features improve neural machine translation. *arXiv preprint arXiv:1606.02892*.
[21] Tu, Z., Lu, Z., Liu, Y., Liu, X., & Li, H. (2016). Modeling coverage for neural machine translation. *arXiv preprint arXiv:1601.04811*.
[22] Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., & Liu, Y. (2015). Minimum risk training for neural machine translation. *arXiv preprint arXiv:1512.02433*.
[23] Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In *Advances in neural information processing systems* (pp. 2204–2212).
[24] Dandekar, N. (2017). *How does an attention mechanism work in deep learning for natural language processing?*. *Quora*. Retrieved 26 July 2017, from https://www.quora.com/How-does-an-attention-mechanism-work-in-deep-learning-for-natural-language-processing
[25] Cohn, T., Hoang, C. D. V., Vymolova, E., Yao, K., Dyer, C., & Haffari, G. (2016). Incorporating structural alignment biases into an attentional neural translation model. *arXiv preprint arXiv:1601.01085*.
[26] Hitschler, J., Schamoni, S., & Riezler, S. (2016). Multimodal pivots for image caption translation. *arXiv preprint arXiv:1601.03916*.
[27] *Microsoft Translator launching Neural Network based translations for all its speech languages*. (2017). *Translator*. Retrieved 27 July 2017, from https://blogs.msdn.microsoft.com/translation/2016/11/15/microsoft-translator-launching-neural-network-based-translations-for-all-its-speech-languages/
[28] Pascanu, R., Mikolov, T., & Bengio, Y. (2013, February). On the difficulty of training recurrent neural networks. In *International Conference on Machine Learning* (pp. 1310–1318).
[29] Sudoh, K., Duh, K., Tsukada, H., Hirao, T., & Nagata, M. (2010, July). Divide and translate: improving long distance reordering in statistical machine translation. In *Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR* (pp. 418–427). Association for Computational Linguistics.
[30] Junczys-Dowmunt, M., Dwojak, T., & Hoang, H. (2016). Is neural machine translation ready for deployment. *A case study on*, *30*.
[31] Bojar O, Chatterjee R, Federmann C, et al. Findings of the 2015 Workshop on Statistical Machine Translation[C]. Tech Workshop on Statistical Machine Translation, 2015.
[32] Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., … & Hughes, M. (2016). Google’s multilingual neural machine translation system: enabling zero-shot translation. *arXiv preprint arXiv:1611.04558*.
[33] Bartolome, Diego, and Gema Ramirez. “Beyond the Hype of Neural Machine Translation,” *MIT Technology Review *(May 23, 2016), bit.ly/2aG4bvR.
[34] ACL 2017. (2017). *Accepted Papers, Demonstrations and TACL Articles for ACL 2017*. [online] Available at: https://chairs-blog.acl2017.org/2017/04/05/accepted-papers-and-demonstrations/ [Accessed 7 Aug. 2017].
[35] Crego, J., Kim, J., Klein, G., Rebollo, A., Yang, K., Senellart, J., … & Enoue, S. (2016). SYSTRAN’s Pure Neural Machine Translation Systems. *arXiv preprint arXiv:1610.05540*.

Analyst: Mos Zhang| Technical Report and Analysis produced by Synced Lab