Unsupervised Machine Translation Using Monolingual Corpora (Paper summary)

OutisCJH

Published in

Analytics Vidhya

6 min readJul 18, 2020

Disclaimer: This post serve as my learning journal.

Link to the original paper: https://arxiv.org/pdf/1711.00043.pdf

Neural Machine Translation (NMT) is very important in today’s word. It allows people who speak different languages to communicate effectively with each other. A good NMT model can efficiently and accurately translate a sentence from one language to another. However, NMT model is very hard to train. This post introduce the unsupervised machine translation model developed by Facebook.

Why unsupervised machine translation?

Parallel corpora datasets are costly to build. It requires a lot manpower and specialised expertise.
Parallel corpora dataset is not available for low-resource languages.

Goal of this paper

To train a general machine translation system without supervision using only monolingual corpus for each language.

Key idea

Build a common latent space between the two languages/domains (e.g English and French) and learn to translate by reconstructing in both domains.
The model as to be able to work with noisy translation (From source to target language and vice versa).
The source and target sentence latent representations are constrained to have the same distribution using an adversarial regularisation term (model tries to fool the discriminator which is simultaneously trained to identify the language of a given latent representation. This is pretty similar to the working mechanism of a GAN).

Architecture

Consists of an encoder and a decoder.

Encoder -> encode source and target sentences to latent space. (Only 1 encoder for both languages)

Decoder -> decode from the latent space to source and target sentences. (Only 1 decoder for both languages). The decoder is language independent.

Let’s declare:

Encoder takes in W and generate Z. Decoder takes in Z and language l to generate words in language l.

Model Design:

Sequence-to-sequence model with attention without input feeding.

Input feeding is an approach to feed attentional vectors “as inputs to the next time steps to inform the model about past alignment decisions” (Source)

The encoder is a bidirectional LSTM which returns a sequence of hidden states. The decoder which is a LSTM takes in previous hidden states, current word and a context vector given by a weighted sum over the encoder states. Both encoder and decoder have 3 layers. As mentioned earlier, both source and target language share the same encoder, same goes to the decoder. The attention weights are also shared between the encoder and decoder. The embedding and LSTM hidden state dimension are set to 300. Sentences are generated using greedy decoding.

Overview of method:

The training consists of several parts:

Denoising part:

Train the encoder and decoder by reconstructing sentence in a particular domain. The input sentences can be from the same or different domain. The input sentence is noisy/corrupted. The input sentence can be corrupted by randomly dropping or swapping words.

The encoder and decoder are trained by minimising an objective function that measures their ability to reconstruct sentences from noisy inputs. Basically it is trained in the same way as the typical Denosing Auto-encoder.

Loss function for this part:

Screenshot taken from the original paper.

2. Cross domain training part:

M is the full translation model consists of encoder and decoder. From the equation 2, we can see that the loss function calculates the sum of token-level cross-entropy losses between x and x_hat. x is sampled from original dataset. x_hat is produced by feeding the corrupted y into M (x_hat ~ d(e(C(M(x)),l2),l1)). Take note of the difference in l1 and l2 in equation 2. )

Simple example that explain equation 2:

Sample a proper English sentence (X), encode it (l1 encode) and feed into model M to generate Spanish. Take the generated Spanish, corrupt it then encode it (l2 encode) and feed into M again to generate X_hat. The loss function will try to reduce the difference between X and X_hat.

3. Adversarial training

The decoder of a neural machine translation system works well only when its input is produced by the encoder it was trained with, or at the very least, when that input comes from a distribution very close to the one induced by its encoder.

To make sure that the input is comes from a distribution very close to the one induced by its encoder, we need a discriminator (like the one in GAN) to train and make sure the encoder can produce output embeddings that can fool the discriminator. The discriminator will be trained to distinguish whether an encoding produced by the encoder is the encoding of source or target. Ultimately, the goal is train the encoder to the extent that it can successfully fool the discriminator, making the discriminator unable to distinguish between source’s and target’s encoding. When this happen, we will be very confident that the input of the decoder comes from a distribution very close to the one induced by the encoder it trained with. Without this adversarial training, we can’t be confident about the distribution that the input came from because the input to the encoder during the first two parts of training can be any languages (either source or target), but now we want to make sure the neural machine translation model works well for source -> target translation, so we add in this adversarial training to make sure it works accordingly.

Loss function for this part:

Final loss function:

Overall model:

Result:

The result published in the paper is very promising. With just 3 iterations, the model manage to generate very good translation.

Personal opinion:

This paper is very easy to read even for a NMT novice like me. It doesn’t require any background knowledge in NMT. I strongly suggest anyone who interested in this topic to read the paper for a more detailed explanation and implementation. This kind of unsupervised NMT training is very important, and i believe it will be even more so in the future as labelled data is always hard to find while unlabelled data is much easier to acquire.

If you like my paper summary, please follow me and give me a clap. Thank you. This is my first time of writing paper summary, feel free to comment and let me know if there is any wrong information. Thank you.