5 Minute Paper Summary: Phrase-Based & Neural Unsupervised Machine Translation, by Facebook AI Research

Thomas Packer, Ph.D.
TP on CAI
Published in
4 min readDec 31, 2019

--

Guillaume Lample et al. of Facebook AI Research published an interesting paper in 2018: “Phrase-Based & Neural Unsupervised Machine Translation”. And by interesting I mean awesome. The paper is 14 pages long. In this story, I provide a link to the paper and some additional related material at the bottom of this story, including a video of Guillaume presenting the paper and a github repo.

This paper is interesting because they are solving the very useful and quite difficult task of machine translation, and they are doing it well using self-supervised machine learning. Self-supervision is a form of unsupervised learning in which a learning algorithm that normally requires human supervision is given automatically-generated (and less reliable) supervision from another algorithm that is a part of the same learning system. To understand how cool this is, you have to ask yourself: how would you program a computer to translate from one language into another? And then ask yourself: could you apply supervised machine learning to make the programming work more scalable despite the cost of human-provided examples? And finally, try to ask yourself: how would you do all that without explicitly supervising the machine learning with expensive parallel corpora? Unsupervised ML is even more scalable than supervised ML but can be very tricky to get to work in practice. But that is what they have done.

Contributions

The paper illustrates two ways to apply and combine three principles for creating a machine translation system using unsupervised machine learning that can beat, in some cases, state-of-the-art supervised MT systems:

  1. Initialization: initialize a translation model with crude approximate mappings such as a word-by-word translation using bilingual dictionary.
  2. Language modeling: use the readily-available volume of text in each language individually to train a language model (without supervision). Language models are used to recognize when (target) language is written grammatically according to the rules of that language.
  3. Iterative back-translation: the full efficacy of self-supervision comes when you train two mutually-dependent translation systems together. The automated translation from language A into language B — even if B is noisy — provides training data with a highly reliable target on which to train a model to translate from language B to language A.

Methods

They follow the above principles when building both a neural machine translation (NMT) system and a Phrase-Based Statistical Machine Translation (PBSMT) system.

For NMT:

  1. Initialization: Instead of relying on a bilingual whole-word dictionary, they initialize their model using a method that relies on sub-word units and byte-pair encoding (BPE). They also jointly learn token embeddings which puts both source and target language vocabulary in the same embedding space.
  2. Language modeling: They combine a noise model that can drop and swap words and a denoising autoencoder for each of the source and target languages.
  3. Iterative back-translation: Given a translation model mapping sentences in the source language to sentences in the target language, and vice versa, they generate pairs of parallel sentences and train one translation model with the output of the other, and vice versa. They also share parameters between languages for both the encoders and decoders of both translation models.

In general, PBSMT models rely on phrase tables to map short phrases of one language into another and a language model to find the most probable target sentence (translation) given a source sentence. For PBSMT:

  1. Initialization: They populate the initial phrase tables (from source to target and from target to source) using an inferred bilingual dictionary built from monolingual corpora (without using any parallel corpora), by aligning monolingual word embedding spaces in an unsupervised way.
  2. Language modeling: They pre-learn (and keep fixed) smoothed n-gram language models for both the source and target languages.
  3. Iterative back-translation: They use the above phrase tables and language models as an initial translation model. Using this translation model, they translation source text to target text thereby generating training data for a typical supervised PBSMT to train a target-to-source translation model. The training process updates and grows the phrase tables. Using that model, the can generate training data for the reverse translation model. They iterate this process until stopping criteria are met.

Personal note: One of my long-time fascinations is the general principle of self-supervised machine learning, or what I call “bootstrapping”. It appears from time to time in different guises. It is related to Blum and Mitchell’s semi-supervised “co-training”. In the best cases, it is fully unsupervised. All forms seem to rely on a fragile cyclical dependency between two mutually-supporting processes plus some kind of smart initialization. The language learning of Tom Mitchell’s NELL (which must have been inspired by the psycho-linguistic work of Pinker) is another important example. NELL is based on a process which co-trains both a knowledge model and a language understanding model in a mutually-dependent way. I will write a story dedicated to bootstrapping later as it is a general pattern which I believe will make a challenge as big as conversational AI much more scalable.

Resources

--

--

Thomas Packer, Ph.D.
TP on CAI

I do data science (QU, NLP, conversational AI). I write applicable-allegorical fiction. I draw pictures. I have a PhD in computer science and I love my family.