Recurrent Neural Networks for Language Translation

Tools that allow any person to communicate with any other person truly make the world a better place. The Rosetta Stone was the first of such tools, evolving to dictionaries and eventually to sophisticated systems such as a language translator that you’ve probably used before, Google Translate. Deep Learning will soon change how these systems work, and the models that will enable such thing have all kinds of applications in NLP even outside the realm of machine translation (such as building an opinion generator, which part of the AI-Society team will actually hack on, so stay tuned).

The Rosetta Stone is considered the first language translator

Let’s first talk about recurrent neural network (RNN) based language models. Yoshua Bengio proposed using artificial neural network based statistical modeling for computing the probability of a sequence of words occurring. This approach proved to be successful; however, feedforward neural networks don’t allow to receive variable length sequences as an input which limits the power of the model. Since RNNs allow variable length sequences both as an input and as an output, they are naturally the next step in statistical language modeling. [1] The RNN architecture is presented in the diagram below.

Simple Recurrent Neural Network architecture model presented by Mikolov et al.

In this model we are given a set of word vectors as an input, we have t time-steps which are equivalent to the number of hidden layers, these layers have neurons (each performing a linear matrix operation on its inputs followed by a non-linear operation). Time-steps generate the output of the previous step and the next word vector in the text corpus is passed as an input to the hidden layer to generate the prediction of a sequence (conditioning the neural network on all previous words).

Equation for computing the hidden state with a linear neural network at each time-step
Equation for the softmax classifier

These are the basics of RNNs. However simple RNN architectures have problems which were explored by Bengio et al. In practice, simple RNNs aren’t able to learn “long-term dependencies”. Let’s analyze the following example in which we try to predict the last word in a sentence:

“I prefer writing my code in Node JS because I am fluent in ______.”

The blank could probably be filled with a programming language, and if you know about backend development you might know that the answer is JavaScript. In order for a program to know this the program needs some context about Node JS and JavaScript from somewhere else in the text. Two fancy types of RNNs solve this problem, Long Short-Term Memories (LSTMs) and Gated Recurrent Units (GRUs). The TensorFlow documentation has an amazing tutorial on language modeling with LSTMs so for the purpose of this blog post we will do the same thing with GRUs.

Gated Recurrent Units

GRUs were introduced by Cho et al. The main idea behind this architecture is to keep around memories to capture long distance dependencies and to allow error messages to flow at different strengths depending on the inputs.

Instead of computing a hidden layer at the next time-step, GRU first computes an update gate (which is another layer) taking the current word vector and hidden state as parameters.

Update gate

Then a reset gate is computed with the same equation but with different weights.

Reset gate

If the reset gate is 0, it only stores the new word information in the memory (reset).

Memory (reset)

The current time-step combines current and previous time-steps to compute the final memory.

Final memory
Clean illustration of the architecture by Richard Socher

Now that you understand the architecture of a GRU cell, you can do some really interesting things. For instance you could train this model and compare the perplexity that an LSTM yields in comparison with a GRU. You could also modify this piece of code and build your own language translator using GRUs instead of LSTMs.

Final Thoughts

Traditional machine translation models are bayesian and at a very broad scope what they do is that they align the source language corpus with the target language corpus (usually at a sentence or paragraph level), after many repetitions of the alignment process each block (sentence or paragraph) has many possible translations, and finally the best hypothesis is searched with Bayes’ Theorem.

EDIT: The papers cited in this post are from 2015 and before. On March 2017 -date in which this post was written- we’ve been told that these systems are already used in production, replacing the mentioned bayesian systems.

[1] There are other reasons. For example, the RAM requirement only scales with the number of words.