A summary of Learning Phrase Representation using RNN Encoder-Decoder for Statistical Machine Translation

Dibu Benjamin
Machine Intelligence and Deep Learning
13 min readMay 2, 2022

Presentation Link:

This report is based on the “Learning Phrase Representation using RNN Encoder-Decoder for Statistical Machine Translation” research paper.
EEL 6812-course project report.

Research Purpose

This paper proposes a novel neural network model called RNN Encoder-Decoder, a novel neural network architecture that can be used as a part of the conventional phrase-based SMT system. That simply words a system for translating a sentence in one langue to another language.

Overview
In the field of statistical machine translation (SMT), deep neural
networks have begun to show promising results. This study continues the research on employing neural networks for SMT by focusing on a unique neural network design that may be utilized as a part of the traditional phrase-based SMT system. The suggested neural network design, dubbed an RNN Encoder-Decoder, is made up of two recurrent neural networks (RNNs) that operate as an encoder and decoder pair.

  1. Encoder: First RNN, maps a variable-length source sequence to a fixed-length vector.
  2. Decoder: Second RNN, maps the vector representation back to a variable-length target sequence.

The proposed encoder-decoder units are trained together to increase the probability of the target sequence given a source sequence. Ans also a new hidden unit architecture is proposed to increase the memory capacity of the network.

Recurrent Neural Network
We must first understand a few key concepts of RNN to fully understand the architecture of RNN Encoder-Decoder.

RNN will work very well on the sequence of data. For example, suppose we have a sentence and need to predict whether it is positive or negative. First of all, need to convert the text data into some vectors. Then we can apply an ML algorithm and find whether the sentence is positive or negative. But the sequence information is discarded in this. Consider the case that needs to find whether a transaction is a fraud or not. In this case, if we change the sequence of the input it doesn't affect the output.

But if it is in the case of a translator, it will make serious damage to the output. For example, instead of saying “I ate sushi on Sunday” we change the input sequence to “I ate Sunday on sushi”. The output translation will be also something else. So in this case the sequence of data is very important. This is one of the main reasons for using RNN instead of ANN in the case of a sequence of data.

The three major problems with using ANN for sequence problems:

  1. Available size of input/output neurons.
  2. Too much computation
  3. No parameter sharing.

RNN Architecture:
If you are trying to watch a movie after the beginning, let's say after half an hour, probably we will not understand what is happening there. So we need to understand the previous events also. Humans always try to connect things with past events. We will not think from scratch every second. Traditional neural networks can’t do this, and it seems like a major shortcoming. RNN addresses this issue. RNN is a special kind of network with loops in between the input and output.

Recurrent Neural Network General Structure

Here Xt is the input, A is the hidden layer and ht is the output. and along with this we also get the output with respect to time.

RNN hidden layer in different time step

Will explain this differently. The image shown above is not four different hidden layers, this is a single hidden layer at 4 different time steps. Consider we have a sentence of four words. RNN at each time will process one word. Now, what will happen in the first time step? X11 will pass to the network as input and the hidden layer will give an output for the next time step whatever was the output for the last timestep, will also send to the layer For the first time step we need some previous activation function O0 as well, let say it’s a vector of all zeros. So the output of each time step can be represented as follows:

  • O1 = f(X11w+O0w1)
  • O2 = f(X12w+O1w1)
  • Where, f is an activation function
  • W is some weight.

So the current output will contain the previous output also that is how the sequence is kept. A recurrent neural network (RNN) is a neural network that consists of a hidden state h and an optional output y which operates on a variable-length sequence x = (x1,……,xT ). At each time step t, the hidden state h<t> of the RNN is updated by

One of the advantages of RNNs is the possibility of connecting earlier knowledge to the current activity. Sometimes this is enough, but if the need to predict something from an event takes place long before this is not enough. Consider a case of trying to predict the last word in the text “I grew up in France………I speak….”. The last word should be French, but to predict this we need the information France. The gap between the relevant information and the point where it is needed is very large. It is entirely possible that RNN may become unable to learn to connect the information as the gap grows.

Long-Short Term Memory Network(LSTM)
Long Short Term Memory networks are a special kind of RNN, capable of learning long-term dependencies. All recurrent neural networks are made up of a series of repeated neural network modules. This repeating module in ordinary RNNs will have a relatively basic structure, such as a single tanh layer.

LSTMs also have this structure, but the repeating module is different. There are four NN layers in LSTM instead of a single one as in RNN.

General Structure of LSTM

The cell state(the upper line)and its many gates are at the heart of LSTMs.

RNN with cell state

The cell state is kind of like a conveyor belt that sends relative information all the way down the chain of events. The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. During training, the gates might understand whatever data is important to preserve or forget. Gates are a mechanism to selectively allow information to pass through. A sigmoid neural net layer plus a pointwise multiplication operation make them up. The sigmoid layer produces integers ranging from zero to one, indicating how much of each component should be allowed to pass. “Allow nothing through!” signifies a value of zero, whereas “let everything through!” means a value of one.

The LSTM has three gates, Forgot gate, Input gate, and Output gate.

  • Forgot gate: This gate helps to identify whether a piece of information should be discarded or saved. The sigmoid function passes information from the previous hidden state as well as information from the current input. If the value from the function is closer to zero it means that current information should not be remembered, and whether the value is close to one then it should be remembered.
Forgot Gate
  • Input gate: The input gate is used to update the cell state. This step is to determine what additional data will be stored in the cell state. The previous hidden state and current input will pass through the sigmoid function first and decides which values we’ll update. 0 means not important, and 1 means important. Then again will pass the available information to the tanh function and the corresponding value will regulate the network. Then we’ll combine these two to create an update to the state.
Input gate
  • Output gate: decide what we’re going to output. This output will be based on our cell state but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1−1 and 11) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.
Output gate

In this paper, they have introduced a new hidden state called GRU.

GRU
GRU is basically a structure developed from the motivation of LSTM but is simpler to compute and implement.

As we can see the main difference between LSTM and GRU is that GRU has only two gates. Update gate and Reset gate. GRUs also merges the cell state and hidden state.

Graphical representation

Here r is the Reset gate and z is the update gate.

LSTM Gates:

Forgot gate
Input gate
Output gate

GRU gates

  • Update gate: this will perform the task of forgot and input gate in LSTM. The update gate controls how much information from the previous hidden state will carry over to the current hidden state.
  • Reset gate: reset gate is used to decide how much past information to forget. If the value of the reset gate is close to zero the hidden state is forced to ignore the previous hidden state and reset with the current input only.

RNN Encoder-Decoder
As I explained before this is a combination of two RNN units.

General Architecture of Encoder-decoder
Graphical Representation

In the encoder, the output will not be present in each hidden state instead it will take in the last hidden state of the encoder. And the combined output will feed as input to the decoder. That is the hidden state of the encoder is a summary(C) of all the input sequences. This is not the same as the RNN which we discussed before. Here both the encoder output and current hidden state are conditioned to the previous output and summary. So the hidden state of the decoder at time t can be denoted as

The next symbol’s joint probability is

The RNN Encoder-Decoder may be utilized in two ways once it has been trained. The model may be used to construct a target sequence from an input sequence. The model, on the other hand, may be used to score a set of input and output sequences, with the score being just a probability.

Statistical Machine Translation
Statistical machine translation (SMT) is a type of machine translation that employs huge amounts of bilingual data to determine the most likely translation for a given input. Statistical machine translation systems learn to translate by examining statistical connections between original texts and human translations. The system’s purpose is to find a translation f from a source phrase e.

Where we have the translation model and a language model. Most SMT systems are modeled as a logliner model with additional features and weights.

The translation model log p(e|f ) is factorized into the translation probabilities of matching phrases in the source and target sentences. These probabilities are weighted accordingly to optimize the BLEU score in the log-linear model and are considered extra features. For this type of model typically the data that we use to learn the models is a table with phrase translation and their probabilities. In this paper, they set up the scoring process as like, training the encoder-decoder on these pair of phrases and then feeding that score as an additional feature into the SMT decoder. They disregard the (normalized) frequencies of each phrase pair in the source corpora when training the RNN Encoder-Decoder. A new score for each phrase pair is added to the current phrase table once the RNN Encoder-Decoder has been trained. This enables the new scores to be included in the existing tuning method with no additional computational complexity.

In this paper, they put forward the possibility that it may be possible to just completely move to the neural machine. With the suggested RNN Encoder-Decoder, the present phrase table may be totally replaced. In such an instance, the RNN Encoder-Decoder will need to build a list of (excellent) target phrases for a given source phrase. This requires a lot of computation so they didn’t implement this.

Experiment
In this paper, they evaluate the approach to the English/French
translation task of the WMT’14 workshop. Some of the available resources are listed below.

  • The bilingual corpora include Europarl (61M words), news commentary (5.5M), UN (421M), and two crawled corpora of 90M and 780M words respectively. The last two corpora are quite noisy.
  • To train the French language model, about 712M words of crawled newspaper material are available in addition to the target side of the bitexts.
  • A subset of 418M words is selected out of more than 2G words for language modeling and a subset of 348M out of 850M words is selected for training the RNN Encoder-Decoder.

The experiment employed an RNN Encoder-Decoder with 1000 hidden units and the suggested gates at the encoder and decoder. The activation function is a hyperbolic tangent. To evaluate the use of the BLUE score, which is essentially like a token matching metric.

Quantitative Analysis
In this paper, they use different combinations of approaches to calculate the BLUE score on training and testing.

1. Baseline configuration

2. Baseline + RNN

3. Baseline + CSLM + RNN

4. Baseline + CSLM + RNN + Word penalty

The first one is the baseline. The baseline phrase-based SMT system was built using Moses. This is free software, an SMT engine that is used to train a statistical model of text translation from a source language to a target language. Then the second setup checks how the SMT model does with the additional score from the RNN encoder-decoder. These scores are basically the probabilities from the encoder-decoder architecture. Then they use a language model along with the previous setup. This configuration will reveal whether various neural network contributions in different portions of the SMT system add up or are redundant. Then also another setup with an additional word penalty. The results for all these setups are shown below.

The evaluation result shows that the performance is high when adding the features calculated by the neural network along with the baseline.

Qualitative Analysis
The phrase pair scores computed by the RNN Encoder-Decoder are compared to the equivalent p(f |e) from the translation model to see where the performance increase comes from.

They concentrate their qualitative analysis on pairings with lengthy (greater than 3 words per source phrase) and common resource phrases. They look at intended phrases that have received high scores from the translation probability p(f j e) or the RNN Encoder-Decoder for each such source phrase. Repeat the approach for those combinations whose source phrase is lengthy but uncommon in the corpus.

The list shows the top three intended phrases for the corresponding source phrase for both the translation model and the proposed RNN model. In almost all the cases the target phrase selected by the proposed RNN encoder-decoder model is the same as the original translation. From the above list, we can conclude one more thing that the RNN model most probably suggests shorter models.

The translation model and the RNN Encoder-Decoder rated many phrase pairs identically, but there were just as many phrase pairings that were scored significantly differently.

phrase pairs according to their scores

This might be due to the suggested method of training the RNN Encoder-Decoder on a set of unique phrase pairs, which prevents the RNN Encoder-Decoder from merely learning the frequency of the phrase pairs from the corpus.

Conclusion
The proposed RNN Encoder-Decoder model can be used in two ways. It can be used to score two input sequences or can be used to generate a target sequence from a source sequence. And also purpose a new hidden state that has only two gates reset gate and an update gate. Which will help to determine how much each hidden unit remembers or forgets when reading/generating a sequence. Also, the proposed model is evaluated against SMT, this was done by scoring each phrase in the phrase table using the RNN model. the results of this comparison show that the new model captures language patterns in phrase pairs effectively and the new proposed model can be used to suggest well-formed target phrases. In terms of BLEU scores, the RNN Encoder-Decoder scores were shown to increase overall translation performance.

Reference

[1] Here is a link to the paper

[2] https://www.youtube.com/watch?v=rdkIOM78ZPk&t=356s

[3] https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

[4] [Hochreiter and Schmidhuber1997] S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.

--

--