Machine Translation Without the Data

8 min readJan 4, 2018

Deep Learning is being aggressively used in day-to-day tasks. It especially excels in areas where there is a degree of ‘humanness’ involved, e.g. image recognition. Probably the most useful feature of Deep Networks, unlike other Machine Learning algorithms, is that their performance increases as it gets more data. So if it is possible to get more data, a performance increase can be expected.

One of the tasks where deep networks excel is machine translation. They are currently the state-of-the-art in this task, and feasible enough that even Google Translate now uses them. In Machine translation, sentence-level parallel data is required to train the model, i.e. for every sentence in the source language there needs to be the translated language in the target language. It is not hard to imagine why this could be a problem. It is hard to get a large amount of data (so that the power of deep learning can be used) for some language pairs.

How This Article is Structured

This article is based on Facebook’s recent paper Unsupervised Machine Translation Using Monolingual Corpora Only

This article loosely follows the structure of the paper. I added my own bits to explain the material to simplify it.

This article requires some (very) basic knowledge about neural networks like loss functions, auto-encoders, etc.

The Problem with Machine Translation

As mentioned briefly above, the biggest problem with using neural networks in machine translation is that it requires a dataset of sentence pairs in both languages. It is available for widely spoken languages like English and French, but will not be available for other pairs. If the language pair data was available, this problem would be a supervised task.

The Solution

The authors of the paper figured out how to convert this task into an unsupervised task. In this task, the only data required would be two arbitrary corpora of each of the two languages, e.g. any fiction novel in English vs. any fiction novel in Spanish. Note that the two novels do not have to be the same.

In the most abstract sense, the authors found out how to learn a latent space that is common between both the languages.

A Recap on Auto-Encoders

Auto-encoders are a broad class of neural networks that are used on unsupervised tasks. The idea is that they are made to re-create the same input that they have been fed. The key is that the network has a layer in the middle, called the bottleneck layer. This layer is supposed to capture all the interesting information about the input and throw away the useless information.

Conceptual Auto-encoder. The middle block is the bottleneck layer that stores the compressed representation.Source

Simply put, the space in which the input ( now transformed by the encoder) lies in the bottleneck layer is known as the latent space.

De-noising Auto-encoders

If an auto-encoder is taught to reconstruct the input exactly the way it was fed to it, it may simply learn to do nothing at all. In this case, the outputs will be perfectly reconstructed, but we won’t have any useful features in the bottleneck layer. To remedy this, de-noising auto-econders are used. First, the actual input is corrupted slightly by adding some noise to it. Then the network is made to reconstruct the original image (not the noisy version). This way, the network is forced to learn useful features of the image by learning what the noise is (and what the really useful features are).

Conceptual Example of a De-noising Auto-Encoder. The Left image is reconstructed using the neural network to produce the image on the right. In this case, the Green neurons together form the bottleneck layer. Source

Why Learn a Common Latent Space?

The latent space captures the features of the data (in our case, the data is sentences). So if it was possible to learn a space that would produce the same features when language A was fed to it as when language B is fed to it, it would be possible to have a translation between them. Since the model has learned the right ‘features’, encoding from language A’s encoder, and decoding using language B’s decoder would effectively be asking it to do a translation.

As you may have guessed , the authors utilised denoising auto-encoders to learn a feature space. They also figured out how to make the auto encoder learn a common latent space (they call it an aligned latent space)in order to perform unsupervised machine translation.

Denoising Auto-encoders In language

The authors used a Denoising encoder to learn the features in an unsupervised manner. The loss defined by them is:

Equation 1.0 Denoising Auto-Encoder loss

Explanation of Equation 1.0

l is the language(for this setup , there will be 2 possible languages) . x is the input. C(x) is just the result after adding noise to x. We will get to noise creating function C shortly. e() is the encoder, and d() is the decoder. The term at the end , with the Δ(x hat ,x) is the sum of cross entropy errors at the token level. Since we have an input sequence , and we get an output sequence , we need to make sure that every token is in the right order. Therefore such a loss is used. It can be thought of a multi label classification , where the ith token in the input is compared with the ith token in the output. A token is a single unit which cannot be broken further. In our case, it is a single word.

So, Equation 1.0 is the loss that will make the network minimze the difference between the output of it(when given a noisy input), and the original , untouched sentence.

The 𝔼 and ~ symbol notation.

The 𝔼 is the symbol for expectation. In this context, it means , the distribution of the inputs need to come from the language l, and the average of the loss is taken. It is just a mathematical formality , and the actual loss during implementation (sum of cross entropy) will be implemented as usual.

The ~ in particular , means “comes from a probability distribution of”.

I wont go into details here, but you can read about this notation in detail in Chapter 8.1 in the Deep Learning Book.

How to Add Noise

Unlike images , where its just possible to add floating point numbers to pixels to add noise, adding noise to language needs to be different. Therefore, the authors developed their own system to create noise. They denote their noise function as C() . It takes in the input sentence, and outputs the noisy version of that sentence.

There are two different ways to add noise.

First, it is possible to simply drop a word from the input with a probability of P_wd.

Secondly, each word can shift from its original position with this constraint

Here, σ means the shifted location of the ith token. So , Equation 2.0 means : “a token can shift from its position at most k tokens to the left or to the right”

The authors used a k value of 3 , and a P_wd value of .1

Cross Domain Training

In order to learn to translate between two languages , there should be some process to map an input sentence(in language A) to an output sentence (in language B). The authors call this cross domain training. First, an input sentence (x) is sampled. Then the translated output(y) is produced by using the model(M()) from the previous iteration. Putting it together we have y = M(x). After that, y is corrupted using the same noise function C() described above ,giving C(y). The encoder of language A is made to encode this corrupted version, and the decoder of Language B is made to decode the output from Language A’s encoder, and recreate a clean version of C(y) . The models are trained using the same sum of cross entropy error like in Equation 1.0.

Learning a Common Latent Space by Adversarial Training

So far , there has been no mention of how to learn the common latent space. The cross domain training mentioned above may somewhat help learn a space that is similar, but a stronger constraint to push the models to learn a similar latent space is required.

The authors used Adversarial Training. They used another model(called Discriminator) that takes the output of each of the encoders, and predict which language that encoded sentence belongs to. Then , the gradients from the discriminator are taken , and the encoder is also trained to fool the discriminator. This is conceptually no different than a standard GAN (Generative Adversarial Network). The Discriminator takes in the feature vector of each time step(because RNNs are used), and predicts which language it came from.

Putting it all together

The 3 different losses(autoencoder loss, translation loss , and discriminator loss) mentioned above are added together , and all the model weights are updated in one step.

Since this was a sequence to sequence problem , the authors used an LSTM network, with attention, i.e. there are two LSTM based autocoders , one for each language.

At a high level, there are 3 main steps to training this architecture. It follows an iterative training procedure. The training loop would look somewhat like this:

Obtain translation using encoder of Language A and Decoder of Language B
Train each Auto-encoder to regenerate an uncorrupted sentence when given a corrupted sentence
Improve the translation by corrupting the translation obtained in Step 1 , and recreating it. For this step the encoder of Language A , and Decoder of Language B are trained together (and also encoder of Language B and Decoder of Language A )

Note that even though step 2 and 3 are listed separately, the weights are updated for both of them together.

How to Jumpstart this Framework

As mentioned above, the model uses its own translation from the previous iteration to improve on its translation capabilities. Therefore, before the training loop begins, it is important to have some form of translation capability already. The authors used FastText , to learn word level bilingual dictionary. Note that this method is very naiive and required only to give the model a starting point.

The whole framework is given in the flowchart below

High Level working of the entire Translation Framework. Credit: Robert Aguilera

Conclusion

This was an explanation of a very new technique to perform unsupervised machine translation. It used multiple different losses to improve on individual tasks, while using adversarial training to add constraints to the behaviour of the architecture.

Call to Action

This was my 10th Medium post. If you have any feedback, or any cool papers you want me to cover, feel free to mention it in the comments.

If you like article, make sure to hold that 👏👏👏 icon for as long as you want to!