A brief history of machine translation paradigms

Published in

HuggingFace

9 min readMay 14, 2020

As a young European, having access to translation at any time just by pulling up my phone is an unprecedented luxury. There’s convenience in knowing I can order a kebab in a hundred languages, cu de tuate und scharf, bez příliš mnoho cibule mais avec frites, and there’s beauty in being able to do so even in countries where older generations remember being in conflict with mine. This made machine translation my first love of sorts among machine learning applications: it is why I originally went from applied mathematics to NLP. This week, Hugging Face is proud to release 1000+ translation models from University of Helsinki thanks to the hard work of Helsinki’s Jörg Tiedemann and our own Sam Shleifer. To accompany the release, here’s a short history of machine translation efforts over the last century. It’s written with a public familiar with modern NLP in mind, and I tried to draw connections with other fields throughout.

1. Genesis (1933–1945)

The first automated translation systems were independently created in 1933, by George Artsrouni in France¹ and Petr Troyanskii in the USSR². Unfortunately, neither really took hold in engineering or research circles, for different reasons. Artsrouni’s system, which was a mechanically automated retrieval system that could function as a dictionary, generated a lot of interest in the French administration but could not come to fruition before the start of the Second World War. Troyanskii’s system, which also started as an automated dictionary but grew to incorporate a memory as well as electronic components (those were at the time still mechanical computers !) was ignored by the Soviet scientific establishment.

*Marian Rejewski’s statue with an Enigma machine in his hometown of Bydgoszcz³*

Elsewhere in Europe, events that would prove (only slightly) more impactful were unfolding at the same time. From 1932 to 1933, the Polish Cipher Bureau — most notably Marian Rejewski, who the Marian NMT system is named after — broke the code of early German Enigma machines. During the Second World War itself, cryptography became a key topic and mobilized significant intellectual and financial resources. After the war ended, with the cold war rising, machine translation became a topic of interest to both superpowers’ intelligence communities. A key problem, for example, was automatically translating scientific articles from the other side, as scientific output out-scaled the number of competent translators.

In this context, the 1949 Weaver memorandum on translation⁴ was a landmark in the US, advocating that automated translation was becoming possible thanks to the newly created computer. It proposed several approaches, like storing the rules of language in the machine or learning statistical similarities between sentences, with even a mention of early efforts on perceptrons. Quite striking is the direct filiation it establishes between machine translation and wartime cryptography efforts: it opens with a war anecdote and raises the task of translating Russian as if it were code.

When I look at an article in Russian, I say ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’

Nothing like a good bout of great-power rivalry for research funding !

2. Rule-based MT (1949–1984)

Early rule-based MT (1949–1967)

After the publication of the Weaver memorandum, research in machine translation began in earnest in the United States, mostly focused on translating Russian scientific articles into English. Translation systems of the time can generally be placed on a scale between empiric and linguistically-grounded approaches. For example, on the empiric end of the scale, research at the RAND corporation proceeded in cycles of translation and editing. First, start from a few basic rules, observing the result on a predetermined corpus of Russian texts. Then, revise the glossary and grammatical rules of the system, and repeat the cycle, in a sort of expectation-maximization algorithm performed by humans (perhaps an ancestor of graduate student descent ?) On the other hand, academic research, especially at MIT, focused on finding intermediate representations between the source and target sentences. Sufficiently expressive representations, it was hoped, could allow for general-purpose translation. Another goal was building an interlingua, e.g. a representation of semantic meaning independent of language; Noam Chomsky was introducing universal grammar at the same time⁵.

A hybrid system to translate Russian technical documents was demonstrated in 1954 at Georgetown University. Deemed very impressive at the time, it spurred investment in the United States and seeded interest elsewhere, mostly in the Soviet Union and in Europe, where research concentrated on the theoretical approach. Systems at the time relied on the work of extensive teams of linguists: they translated human instructions into code rather than learning word correspondences on their own.

Knowledge-based MT (1967–1984)

The 1967 ALPAC report⁶ is generally held to have been the end of that first phase of machine translation hype, after it made the case that American research funding should be directed to machine-aided human translation rather than fully automated machine translation. After its publication, research funding dried up in the United States, leaving machine translation research efforts in Canada and Europe, and the Soviet Union.

We have already noted that, while we have machine-aided translation of general scientific text, we do not have useful machine translation. Further, there is no immediate or predictable prospect of useful machine translation.

Classic reviewer #2.

One influential concept to understand the evolution of rule-based MT during that time is the Vauquois pyramid, reproduced here. First, the system attempts to understand the source text (analysis) and to represent this understanding. Then, it produces text in the target language (generation) from this representation. This was christened knowledge-based MT: the goal was to have ever more complete and general representations, moving up the pyramid, as opposed to earlier rule-based systems’ direct translation or only syntactically-informed translation. However, transfer machine translation, operating at a lower level, remained more effective and powered the systems of the time. Those were mostly domain-limited technical use cases like Canada’s Météo system.

3. Data-driven MT (1984-present)

Example-based MT (1984–1993)

*Looks like you’ve missed your daily French lesson today!*

By the 1980s, computers had gotten a lot more powerful, especially in storage capacities. This allowed for larger databases of text, which remained yet to be systematically exploited. One early idea to do so was example-based MT, which was first proposed in 1984 in Japan⁷. Example-based systems made the observation that beginner-level foreign language speakers rely on sentences they already know to produce new ones: an example between French and English is shown in the figure⁸. Similarly, they relied on databases of known examples to produce new translations, querying the closest one. Although those ideas would eventually be subsumed in the broader framework of statistical MT, they were the first example of data-driven translation.

Statistical MT (1993–2013)

Underneath all this, a revolution was brewing, as a few outside developments came to fruition in the 90s. Statistical speech recognition started showing strong results on the back of advances in automata theory and hidden Markov models; computers became more powerful and accessible still; and high-quality, abundant datasets appeared, such as the Hansards accounts of the Canadian parliament. In 1988, IBM researchers had published the outline of modern statistical translation⁹, which proved controversial to say the least. As a famous anonymous review of the time states:

The validity of a statistical (information theoretic) approach to MT has indeed been recognized, as the authors mention, by Weaver as early as 1949. And was universally recognized as mistaken by 1950 (cf. Hutchins, MT — Past, Present, Future, Ellis Horwood, 1986, p. 30ff and references therein). The crude force of computers is not science. The paper is simply beyond the scope of COLING.

Reviewer #2 strikes again !

Nevertheless, the statistical approach quickly proved fruitful, as IBM’s models 1–5 became references in machine translation. Those were powered by the expectation maximization algorithm to learn both alignments between languages — which and how many words in the source and target correspond to each other — and a dictionary to translate after computing alignments. In a sense, they were direct descendants of the early RAND empirical approach: instead of being fed instructions by teams of linguists, the computer could learn all of the relationships from data on its own. By the 2010s, statistical methods had asserted their hegemony, as they powered virtually all of the internet-based translation services that comprise the bulk of translation use.

Neural MT (2013-present)

In his 1949 memorandum, Warren Weaver briefly touches upon early perceptron research as a promising avenue for machine translation. 60 years later, neural networks had made significant progress in other tasks, but had yet to be convincingly applied to translation. The first functional neural language models appeared in 2011, powered by recurrent neural networks¹⁰. Translation could then be reformulated as a conditional language modeling task: instead of predicting the most likely next word, predicting the most likely next word conditioned on the source text. The first modern machine translation paper appeared within a few years, in 2013. It consisted of an encoder model that produced a representation of the input with a convolutional neural network and of a decoder model that generated text from that representation with a vanilla recurrent neural network (RNN)¹¹.

At the time, neural MT was still underperforming compared to statistical MT, and it required two main developments from 2014 to eventually come out on top. First, vanilla RNNs were replaced with long short-term memory RNNs (LSTMs)¹². Then, learnable attention mechanisms were re-purposed from their computer vision roots and added to LSTMs¹³. By 2016, Google Translate had switched to neural MT. Transformer-based models¹⁴, which do away with the recurrent network part and only use iterated attention modules, have become the norm in recent years as they scale better than LSTMs with compute time and available data. The Helsinki models we’re releasing today all rely on this architecture. The power of transformers was quickly noticed outside of machine translation and, combined with pre-training, they now form the backbone of most modern NLP applications.

Conclusion

MT performance can be divided in 3 goals: human-level translation, general-purpose translation, and translation without human input. Older knowledge-based systems could do human-level translation without human input but only on very narrowly defined data. Human-in-the-loop systems had faster human-level translation but required manual intervention. Finally, statistical translation could handle any text without a human being, but not always at the level of human translators. If you’re lucky enough to have to translate between data-rich similar languages, Neural MT offers the best of all worlds. However, if the language pair you’re interested in is not data-rich, there is still quite a bit of work to do before we get there. An interesting project to realize the size of the task at hand is DARPA’s Lorelei program, which simulates a crisis in a region of the world whose language is underserved, and asks researchers to build a translation system in two weeks. Even for languages spoken by tens of millions of people, throwing teams of highly trained linguists at the problem is sometimes still the way to go!

References

[1] “La machine à traduire française aura bientôt trente ans”, Automatisme 5(3): 87–91, M. Corbé, 1960

[2] Machine translation: past, present, future. J. Hutchins, 1986

[3] Marian Rejewski statue photo from Peter Reed

[4] Reproduced in: Locke, W.N.; Booth, D.A., eds. (1955). “Translation” (PDF). Machine Translation of Languages. Cambridge, Massachusetts: MIT Press. pp. 15–23. ISBN 0–8371–8434–7.

[5] Aspects of the Theory of Syntax, Noam Chomsky, 1965

[6] LANGUAGE AND MACHINES: COMPUTERS IN TRANSLATION AND LINGUISTICS, ALPAC 1966

[7] A framework for a mechanical translation between Japanese and English by analogy principle, Nagao 1984

[8] EBMT figure from Purest ever example-based machine translation: detailed presentation and assessment, Y. Lepage, E. Denoual, Machine Translation, Springer Verlag, 2007, pp.251–282. hal-00260994

[9] A Statistical Approach to Language Translation, P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, P. Roosin, COLING, 1988.

[10] RNNLM — Recurrent Neural Network Language Modeling Toolkit, T. Mikolov, S. Kombrink, A. Deoras, L. Burget, J. Černocký, 2011

[11] Recurrent Continuous Translation Models, N. Kalchbrenner, P. Blunsom, 2013

[12] Sequence to Sequence Learning with Neural Networks, I. Sutskever, O. Vinyals, Q. Le, 2014

[13] Neural Machine Translation by Jointly Learning to Align and Translate, D. Bahdanau, K. Cho, Y. Bengio, 2014

[14] Attention Is All You Need, A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin, 2017