Recurrent Neural Networks for Language Translation
Tools that allow any person to communicate with any other person truly make the world a better place. The Rosetta Stone was the first of such tools, evolving to dictionaries and eventually to sophisticated systems such as a language translator that you’ve probably used before, Google Translate. Deep Learning will soon change how these systems work, and the models that will enable such thing have all kinds of applications in NLP even outside the realm of machine translation (such as building an opinion generator, which part of the AI-Society team will actually hack on, so stay tuned).
Let’s first talk about recurrent neural network (RNN) based language models. Yoshua Bengio proposed using artificial neural network based statistical modeling for computing the probability of a sequence of words occurring. This approach proved to be successful; however, feedforward neural networks don’t allow to receive variable length sequences as an input which limits the power of the model. Since RNNs allow variable length sequences both as an input and as an output, they are naturally the next step in statistical language modeling.  The RNN architecture is presented in the diagram below.
In this model we are given a set of word vectors as an input, we have t time-steps which are equivalent to the number of hidden layers, these layers have neurons (each performing a linear matrix operation on its inputs followed by a non-linear operation). Time-steps generate the output of the previous step and the next word vector in the text corpus is passed as an input to the hidden layer to generate the prediction of a sequence (conditioning the neural network on all previous words).
These are the basics of RNNs. However simple RNN architectures have problems which were explored by Bengio et al. In practice, simple RNNs aren’t able to learn “long-term dependencies”. Let’s analyze the following example in which we try to predict the last word in a sentence:
“I prefer writing my code in Node JS because I am fluent in ______.”
Gated Recurrent Units
GRUs were introduced by Cho et al. The main idea behind this architecture is to keep around memories to capture long distance dependencies and to allow error messages to flow at different strengths depending on the inputs.
Instead of computing a hidden layer at the next time-step, GRU first computes an update gate (which is another layer) taking the current word vector and hidden state as parameters.
Then a reset gate is computed with the same equation but with different weights.
If the reset gate is 0, it only stores the new word information in the memory (reset).
The current time-step combines current and previous time-steps to compute the final memory.
Now that you understand the architecture of a GRU cell, you can do some really interesting things. For instance you could train this model and compare the perplexity that an LSTM yields in comparison with a GRU. You could also modify this piece of code and build your own language translator using GRUs instead of LSTMs.
Traditional machine translation models are bayesian and at a very broad scope what they do is that they align the source language corpus with the target language corpus (usually at a sentence or paragraph level), after many repetitions of the alignment process each block (sentence or paragraph) has many possible translations, and finally the best hypothesis is searched with Bayes’ Theorem.
EDIT: The papers cited in this post are from 2015 and before. On March 2017 -date in which this post was written- we’ve been told that these systems are already used in production, replacing the mentioned bayesian systems.
 There are other reasons. For example, the RAM requirement only scales with the number of words.