Seq2Seq Models : French to English translation using encoder-decoder model with attention.

Hardik Vagadia

Published in

Analytics Vidhya

8 min readJul 6, 2020

A Tensorflow implementation of machine translation with attention mechanism.

1- An introduction to Sequence-to-Sequence models :

Sequence-to-Sequence learning is all about training models to convert the sequences from one domain to another domain. One important thing to note here is that both the sequences may note be of same length. A typical Seq2Seq models consists of an encoder and a decoder which are themselves two separate neural networks combined into a single giant network. Both encoder and decoder are typically LSTM or GRU models.

Some applications of Seq2Seq models are Neural machine translation, Image captioning, speech recognition, chat-bot, time-series forecasting e.t.c.

The job of encoder network is to understand the input sequence and create a smaller dimensional representation of it which is in turn forwarded to the decoder network that generates the output.

Input to the encoder may be an encoded sentence (in case of neural machine translation) or image features (in case of image captioning) or even sound waves (in case of speech recognition).

In this blog, i will discuss encoder-decoder model for neural machine translation. I will train the model to translate French sentences into sentences in English language.

2- Encoder :

Let’s understand the encoder architecture. Assume that our input sentence has ’n’ words. We are assuming that there is only a single sentence in our corpus for simplicity.

French : “Je ne parle pas anglais”
Translation : “I do not speak English”
Input Vocabulary : {Je, ne, parle, pas, anglais} (5 unique words).

Xi → Since we are going to use word-level encoding, input at each time stamp will be each words in the sentence. Which means X1 = ‘Je’, X2 = ‘ne’, and so on.. up to X5 = ‘anglais’. If we were using character level encoder-decoder model, then input at each time stamp would have been a single character : X1 : ‘J’, X2 : ‘e’, X3 : ‘n’, and so on. Each word is represented in the form of a vector. For this, each word is replaced by word_index of that word in the corpus. Most frequent words have word index smaller than the least frequent words.
‘h0' and ‘c0' → Initial Hidden state and Context vectors which are all zeros (generally) and fed at the 0th time stamp to the encoder. After this, we start feeding input words.
‘hi’ and ‘ci’ → Hidden state and Context vectors after time stamp i. These vectors in simple terms, represent what the encoder has seen until this time stamp. For example, h3 and c3 will remember that the network has seen “Je ne parle” till now. The size of each of these vectors is equal to the number of units of LSTM/GRU. The state obtained after the last time stamp is fed into decoder as decoder initial states.
‘Yi’ → Output at time stamp i. It is the probability distribution over the entire vocabulary which is generated by using the Softmax activation function. We do not require the outputs from the encoder network and hence we will discard them. The only encoder outputs that we care for are the hidden/context vectors.

Note : Above diagram is what an LSTM/GRU cell looks like when we unfold it on time axis. i.e. it is the single LSTM/GRU cell which takes a single word at each time stamp. We can have multiple such cells in our encoder/decoder networks.

3- Decoder :

After learning about encoder, now let’s move on to the another part : Decoder network. Unlike encoder, decoder behaves differently in training and inference phase. Also, we need to add two special tokens to the output sentence for the reasons explained below. These tokens are “<start>” (at the beginning of the string) and “<end>” (at the end of the string).

a). Decoder in training phase :

Note that final states of encoder is set as initial state of the decoder. In the initial state, we provide ‘<start>’ as input so that decoder starts generating the next token (first word of English sentence). We use a technique called as “Teacher Forcing” in which feed the actual output (and not the predicted output) from the previous time-stamp as input to the current time stamp. After inputting the last word of the actual translation, we make our decoder to learn to predict ‘<end>’ denoting the end of translation. This ‘<end>’ token acts as a stopping condition during the inference stage.

The entire training of (encoder+decoder) network can be summarized as in the below diagram :

**Entire encoder-decoder training network**

a). Decoder in inference phase :

The only difference between the decoder in training and inference phase is that in inference phase, the predicted output (and not the actual output unlike the training phase) from the previous time stamp is fed as input to the current time stamp. Rest is all same as the training phase.
The entire inference process is summarized in the below diagram :

Attention!

Whatever discussed until now was a simple encoder-decoder model without the attention mechanism.
A major drawback of this model is that they tend to forget the earlier part of the sequence once they process further. It is not good for longer sentences. Look at the below diagram :

Above diagram is a plot of sentence length vs BLEU score. The first model (RNNSearch-50) employs attention mechanism while rest of the 3 models doesn’t.
We can clearly see that as the sentence length increases, BLEU scores decreases for the later 3 models while it remains stable for RNNSearch-50. Hence, attention is a very crucial aspect when dealing with longer sequences. Let’s discuss it in this section.

4- How the Attention works?

Let’s understand it step-by-step by taking our previous example :

French : “Je ne parle pas anglais”

Translation : “I do not speak English”

Below diagram shows their corresponding hidden encoder states :

Getting encoder hidden states : First of all the hidden states after each time stamp are obtained from the encoder as shown in the above diagram.
Setting encoder final state as decoder initial state : Since we are trying to predicted first word of the output sequence, the decoder will not have any current internal states. For this reason, we will use final encoder states (h5) as the initial decoder state.
Computing Scores : Now, using all the encoder states and the current decoder state, we train a simple feed-forward neural network which will learn to identify relevant encoder states by generating high scores for them and lower scores for the irrelevant states. For example, to predict the word “speak”, the relevent information can be in states h1, h2, h3 and the remaining states h4 and h5 may be irrelevant. Our feed-forward-NN will learn to give high scores to the first 3 states and lower scores to the later 2 states. Let these scores be [s1, s2, s3, s4 and s5 ] respectively.

Getting attention weights : The scores obtained from the previous stage are fed to the “Softmax” function to obtain the attention weights. Let these weights be e = [e1, e2, e3, e4, e5]. The sum of all these weights is equal to 1. Hence, they give a nice probabilistic interpretation.

**Obtaining attention weights from softmax**

Computing context vector : After getting the attention weights, context vector is computed as below. It will be used be decoder to predict the next word.

context-vec (cv) = e1*h1 + e2h2 + e3*h3 +e4*h4 + e5h5

Concatenate context-vector with previous output : For the first time stamp, we do not have any previous output and hence we will concatenate ‘<start>’ with the context vector. We then feed the merged vector to the decoder which will use it to predict next word.

Code Walk-through :

First of all, like any other NLP task, we load the text data and perform pre-processing and also do a train-test split.
As part of cleaning, we remove html tags, numbers and unwanted symbols from text data.
Download french-to-english dataset from this link.

Tokenizing and padding.
Padding is performed so that length of all sentences are same (equal to max_len).

Creating data input pipeline with tensorflow.data. Refer this link to learn more about tensorflow data.

Defining encoder and decoder architecture.

Defining optimizer, loss function and setting checkpoint directory path to save progress while training.

And now finally, the training loop. It uses teacher forcing concept discussed above for training.
It also saves progress at every second epoch.

Functions for evaluating the test data and plotting the attention weights :

Evaluating test data :

Our translation will be evaluated on BLEU (Bilingual evaluation understudy) score. It is a value between 0 and 1. More closer to 1 the BLEU score is, the better. To learn more about it, please refer this and this links.
I have trained model for 6 epochs and loss was around 0.05. You can see that some translations are perfect and some are not that good.

Conclusion :

Translations on test data are fairly accurate. Some of them are not up to the mark
We can add more data and run it for large number of epochs for better translations.
We can also use different attention score functions (dot and general score functions).
Using LSTM instead of GRU with bidirectional wrapper can also improve translation greatly.

References :

https://arxiv.org/pdf/1409.0473.pdf
https://arxiv.org/abs/1409.3215
https://medium.com/@martin.monperrus/sequence-to-sequence-learning-program-repair-e39dc5c0119b (Using seq-to-seq learning for program repair).
http://www.manythings.org/anki/ (Link for various datasets)
Keras tutorial for seq-to-seq learning.
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention (Tensorflow attention)
https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39