How good is Neural Machine Translation ?

Jerry Liu
Lion IQ
Published in
6 min readOct 4, 2017

Deep Learning drew a lot of attention for Google Translate’s successful deployment in a consumer product back in 2016. (Feels like it was so long ago!) So, how good is Machine Translation today, and is it “solved”?

(This blog post is an extended adaptation of a talk I gave at Shanghai Machine Learning Meet-up in Sep 2017)

Is Machine Translation solved?

Translation seems like a task very well suited to AI. The idea of a device or intelligent robot that is able to translate and converse alien languages has graced many popular science fiction. Today, the consumer consensus is to turn on Google Translate (or Baidu / Youdao / Sogou if you’re in China) and accept the fact that the results gets you most of the way there, with weird “lost-in-translation” sentences here and there.

We might get self-driving cars before we solve AI translation?

Many if not all of my expat friends in China have trouble using Baidu Maps, despite some having lived in China for a number of years. This is a strange phenomenon in 2017; sometimes the VPN app works but Google Maps still has trouble finding a few locations, and often has outdated navigation.

So I built Map Captain, a mobile app to translate English/Chinese search queries and directions on top of Baidu Maps. Map apps used by local people in a foreign country is always superior to a tourist map resource, we just need to remove the language barrier.

China maps in English. http://mapcaptain.io

Along the way, I wanted to know how good was state of the art machine translation, and in particular, English-Chinese translation.

A brief history of Machine Translation

Machine translation (MT) has a brief but rich history, and starts as all machine learning domains do, with a parallel corpus. The Rosetta Stone, is a famous parallel corpus carved on a stone slab, that allowed us to learn Ancient Egyptian, which at the time was an unknown language. Today, MT systems have vastly larger datasets since tapping into the web. The history of Statistical Machine Translation is decorated with many achievements, but out of scope of this article.

Neural Machine Translation, the application of deep neural networks for machine translation, has only hit the scene a few years ago, but has since made big sweeps across many translation tasks.

Is Neural Machine Translation the New State of the Art?

In the study “Is Neural Machine Translation the New State of the Art?” (2017) the authors compare Neural Machine Translation (NMT) versus Statistical Machine Translation (SMT) across three domains:

  • E-commerce Product Listing (En-De)
  • Patent Titles and Abstracts (Zh-En)
  • MOOC content (En-De, En-EL, En-Pt, En-Ru)

In their abstract, they note: “Automatic Evaluation results presented for NMT are very promising, however human evaluations show mixed results”. This is in line with typical consumer experience of tools such as Google Translate.

The Patent Titles and Abstracts study is particularly interesting, as it is a collection of legal documents written in Chinese. Firstly, most academic research in NMT compares results of western languages such as English-German or English-French. Chinese and English don’t share any alphabet or common word stems compared to western languages that share Latin origins. Secondly, we can expect that in a collection of Chinese legal documents written with terse technical jargon there would be a large vocabulary of words that a model may not have seen often in its training data. The NMT system in this case fared poorly compared to the SMT system which may have specified features and language models.

In summary, NMT wins in the MOOC automatic evaluation, does poorly when it sees words/phrases it doesn’t know, and is bad at Chinese.

Is Neural Machine Translation ready for deployment?

In this 2016 study, the paper authors compare translation quality of phrase-based SMT and NMT across a wide range of language pairs. The study is conducted on the United Nations Parallel Corpus which is available in six languages.

With the exception of French-Spanish, NMT outperforms SMT across the board. Particularly impressive are the translation results involving Chinese. It should be noted however, that these are the worst performing SMT results, suggesting SMT results in these experiments could have some margin of improvement.

https://arxiv.org/abs/1610.01108

A second chart shows translation results with English as one of the language pairs. Hiero, a hierarchical phrase based SMT system confirmed to outperform phrase based SMT in Chinese-English translation is added to the experiment. In all cases, NMT outperforms all SMT, and still outperforms Hiero by a significant margin.

https://arxiv.org/abs/1610.01108

In summary, the paper conducted a study on an open six-way dataset and compared 30 language pairs and found NMT to outperform SMT in 29 of them, with particularly large gains in Chinese.

This contrasts with the results in above section, suggesting that seasoned SMT systems with rich feature engineering adapted to specific domains still set a high bar for NMT to catch up to.

Where are we now?

NMT systems achieved top results at most language pairs at WMT’16, and it is no coincidence that Google, Baidu, and other commercial NMT deployments also launched in 2016.

In a study of human evaluation of NMT, Neubig et al find that NMT results are more fluent and more grammatical than SMT results. They find that NMT generates results with more accurate word order, but also note that NMT results generally favor more common words, citing the example of mistaking “radiant heat” as “radiation heat” or “slipring” as “ring.”

Translation by nature is open vocabulary; the sentences encountered at testing may very well be unknown during training. There have been many different methods to address the issue of unknown or rare words (or worse, spelling mistakes) and is a broad topic onto itself. We will just briefly mention here that most of the top NMT results at WMT’17 use Byte-Pair Encoding to segment words into subword units to address this issue.

Parallel translation corpus are particularly costly, and deep neural networks need large training datasets to work efficiently. While there are more datasets available for researchers and startups to work on today, we still lack resources in domain specific corpus to work on domain adaptation.

Perhaps the most well known problem for Neural Machine Translation is the expensive training cost; modern state of the art models take days to train on multiple GPU.

https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html

Compared to image recognition problems, Neural MT don’t currently have well studied transfer learning properties to bootstrap model training.

Closing summary

  • NMT has achieved state of the art MT, but is far from solved.
  • Methods to deal with open vocabulary are an important part of NMT.
  • NMT models are still very expensive to train.

There is a lot going on in Neural Machine Translation, and the field has made rapid strides in recent years. Just as image recognition made leaps to surpass human vision accuracy, hopefully we will continue to make strides towards surpassing human translation accuracy.

Comments welcome!

--

--

Jerry Liu
Lion IQ
Editor for

Building AI products. AI Engineer and Product Manager.