# Learning phrase representation using RNN Encoder-Decoder for Machine Translation

**Citation**: https://www.aclweb.org/anthology/D14-1179

**Introduction:**

RNN encoder decoder model first proposed by Cho et al in 2014. It uses two recurrent neural network. One encodes the input sequence into a fixed length vector representation and another decodes that representations into another sequence of symbols. They are jointly trained to increase conditional probability of the target sequence given the input sequence. In addition to standard log loss of recurrent neural network using conditional probabilities of phrase pairs computed by RNN encoder-decoder found to improve empirical performance. The paper has shown that Encoder-Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases.

Deep neural network has shown great success in object recognition tasks. Furthermore it has been successful in NLP tasks such as language modeling (Bengio et al), word embeddings extraction (Mikolov et al) and paraphrase detection (Socher et al).

Encoder-Decoder model maps the vector representation back to a variable length target sequence. Additionally, it proposes a sophisticated hidden unit in order to improve both the memory capacity and the ease of training.

The paper described that it uses a novel hidden unit is empirically evaluated on the task of translating from English to French. We train the model to learn the translation probabilities of an English phrase to a corresponding French phrase. The model is then used to part of a standard phrase based SMT system by scoring phrase pairs in the phrase table. Encoder-Decoder model is better in capturing linguistic regularities in the phrase table, indirectly explaining the quantitative improvement in the overall translation performance. It also learns continuous space representation of the phrase that preserves both the semantic and syntactic structure of the phrase.

**RNN Encoder-Decoder:**

RNN is a neural network that consists of a hidden state h and an optional output y which operates on a variable length sequence x = (x1, …..,xT). At each time step t, the hidden state h(t) of the RNN is updated by

h(t) = f (h(t−1) , x(t)) 1

Where f is a non-linear activation function. F may be as simple as an element wise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit (Horchreiter and Schmidhuber, 1997).

The output at each time t is the conditional distribution p(xt|xt-1, ….x1) For example, a multinomial distribution (1-of-K coding) can be output using a softmax activation function.

For all possible symbols j = 1, ….,K where w(j) are the weight of the rows of a weight matrix W. by combining these for all symbols we compute probability of sequence x as

For this distribution we can sample output sequence conditioned on input sequences.

So, an encoder RNN (or its other variants such as LSTM, GRU) learns a fixed length vector representation from variable length sequence input (its vectorized equivalent such as embeddings). Decoder RNN (similar to encoder architecture) converts (or decodes) a variable length sequence output from the fixed length vector representation as explained in the picture bellow.

Here we can make a note that sequence length of encoder input and decoder output (T, T’) could be different as in the case of machine translation happens more frequently where length of a sentence of source language (or word) may not necessary have same length when translated into target language.

At each time step, encoder reads the input at that time step and also add hidden state from previous step. At the end of input sequence it outputs a hidden state that are meant to encode entire input sequence of both semantic and structural regularities in the input sequence. Note that here in this type of process there is no additional feature specific grammars syntax are applied and encoder RNN pretty much encodes the necessary features from the sequence itself. This is quite amazing in comparison with traditional statistical machine translation technique.

Decoder on the other hand generates output y(t) at time step t given the final hidden state from encoder C and also depends on previous hidden state of decoder and previous output y(t-1) unlike RNN shown above (Eq. 1)

h(t) = f(h(t-1), y(t-1), c) and so conditional distribution changes to:

P(yt |y(t−1), y(t−2), . . . , y(1), c) = g(h(t) , y(t−1), c)

Both encoder and decoder RNN are jointly trained with objective function:

Where theta is the set of model parameters and (xn, yn) represents input and output sequence. Typically as standard RNN this is differentiable and optimized using Gradient Descent algorithm.

Encoder-decoder can be used for predicting a next sequence given a sequence as output and also generating a score for a given input-output sequence combination.

**GRU:**

Cho also introduced GRU — Gated Recurrent Unit which is a special case of LSTM architecture and much simpler. The paper of encoder-decoder model proposed GRU to be used as RNN unit for both encoder and decoder architecture.

They proposed a reset gate and an update gate. Reset gate intuitively helping to forget what is not important and ensure to pay attention to what is current input.

Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

As we see from the equation of the reset gate if reset gate vector is zero then Wr . h(t-1) get to become zero and in that case it ignores past sequence and only focus on current input Xt. This allows to ignore information that are irrelevant from past sequence. For example, in a movie review starts with describing the plot and lastly comment that it was a boring movie then all the past information that talks about plot of the movie wasn’t informative to classify this review as negative, only information that is sufficient is where it says movie is boring.

Update gate whereas allows to pay attention to all past sequences and balances reset gate’s effect. It ensure to pay attention no matter how useful past sequences are. For example, it will take care a situation where movie review starts with saying “I love this movie” and then starts to get into movie details. The first sentence was good enough and needed to be memorized. Update gate just does that and as Z gets to be close to 1 network can just copy hidden vectors from past and also reduces possibility of vanishing gradient remarkably. Units with short term dependencies has reset gate very active and units with long term dependencies makes update gate very active. Also these effect of reset gate and update gate are learned jointly using back propagation and no need any special processing.

Encoder-decoder model has been proved more robust and scalable compared to statistical machine translation model such as CSLM. They have compared RNN and also RNN + CSLM model and compared result. Paper has found that best phrase translation was found using both CSLM and phrase scores form RNN encoder-decoder model. RNN encoder-decoder model able to learn linguistic regularities whereas standard STM model learns statistics of word co-occurrences in the corpus and relied on the statistics.

**Conclusion:**

RNN encoder-decoder model able to learn from arbitrary length sequence and generate target sequence of arbitrary length. The model is very successful either predicting a target sequence based on a input sequence or output a probability score given a input and output pair. When tried to score for machine translation task out of phrase mapping table model demonstrated capability to understand linguistic regularities and able to propose well-formed target phrases. Current state of the art language model and many NLP tasks are based on the premise of this encoder-decoder model including machine translation and sentiment analysis.