Neural Machine translation and the need for Attention Mechanism

Published in

Analytics Vidhya

12 min readOct 10, 2019

Intuitively, h1 and c1 here contain the information of ‘This’ word which we inputted at timestep t0. Now LSTM at time step t1 will take h1 and c1 as input with next word in the sequence which is ‘is’. Vector h3,c3 comprises information till word 3 which is ‘a’. So on till last time step 5, we will get h5 and c5 which contains information about the whole input sequence.

So now our input sequence “This is a good phone” gets converted to vector h5 and c5. We discard the output vectors (y1 to y5) because we don't need them here. We only need output state vectors as they will be containing the information about the given input sequence.

We will now initialize our decoder with these final encoder state vectors which are h5 and c5 rather than randomly which we did with encoder LSTM. Logically also it makes sense because we want our decoder to not just start randomly but to have a sense of what the input sequence is.

3. Decoder Architecture:

Decoder LSTM will also have the same architecture as encoder but with different inputs and outputs. Now there are two things, training phase, and the inference phase. We will touch on the inference phase later. First, finish the training phase.

We have the vectors encoded by our LSTM encoder. The h0 and c0 of the decoder are not initialized to random but with the h5 and c5 which we got from the encoder.

Also to make things work, we add _START_ symbol to the start and _END_ symbol at the end of the target sequence. Now the final sequence becomes ‘_Start_ यह एक अच्छा फोन है _END_’

Here X1=_Start_ and Y1=यह with h0=h5 of encoder and c0=c5 of ecoder. This returns state vectors h1, c1 which is inputted to the decoder at the next time step and output Y1 which is inputted as ground truth to the decoder. This continues until the model encounters the _END_ symbol. At the last time step, we ignore the final state vectors of the decoder (h6,c6) because it is of no use to us, we only need output Y’s.

This technique is also called “Teacher Forcing”.More on this here.

The entire training architecture (Encoder + Decoder) can be summarized in the below diagram:

Given the final architecture, We can now predict outputs from each time step and the errors are then back propagated through time in order to update the parameters of the whole network and then compute our training loss.

4. The Inference Stage:

The task of applying a trained model to generate a translation is called inference or more commonly decoding the sequence in machine translation.

We have a trained model, now we can generate predictions based on a given input sequence. This step is basically known as inference. You can refer it as a testing phase while the above steps were training phase. At this step, we only have the learned weights and input sequence which is to be decoded.

The model defined for training has learned weights for this operation, but the structure of the model is not designed to be called recursively to generate one word at a time. For this, we need to design new models for testing our trained model. There are many ways to perform decoding.

The inference stage contains two different encoder-decoder models which will act as stand-alone models for their respective purposes.

The encoder model is simple as it takes the input layer from the encoder in the trained model and outputs the hidden and cell state tensors.

https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/

The Decoder model is a more involved one. It requires three inputs, the hidden and cell states from the encoder as the initial state of the newly defined encoder model and the encoded translation output so far. For every word, it will be called i.e we have to make this inside a loop and the loop will end once it sees the _END_ symbol.

We’ll use this inference model to finally translate the sequence from one language to another.

5. The Problem:

“Attention” is one of the recent trends in the Deep Learning Community. Ilya Sutskever, the man behind the above seq2seq architecture for machine translation mentioned that ‘Attention Mechanisms’ are one of the most exciting advancements in the above encoder-decoder approach, and that they are here to stay. But what is the problem behind that approach and what does attention solves?

To understand what attention can do for us, let’s go over the same machine translation problem above. We wanted to translate the sentence ‘This is a good phone’ to ‘यह एक अच्छा फोन है’. Remember we used an LSTM encoder to map the English sentence into final state vectors. Let's see that visually:

We can see that vectors h5,c5 must encode everything we need to know about the source sentence. It must fully capture its meaning.

But there is a catch.

It basically captures the meaning of the whole sentence but this is when the sentence is not long. For example, the sentence that we took is only 5 words long and the normal encoder will do justice to it but when the sentence is 50 or 100 words long articles this single final vector will not be able to map the fully whole sequence. Look at it this way, If the sentence is 100 words long, the first word of the source sentence is probably highly correlated with the first word of the target sentence. But that means decoder has to consider information from 100 steps ago, and that information needs to be somehow encoded in the vector. RNN’s have this old problem of long term dependencies. In theory, architectures like LSTM should be able to deal with this, but in practice, long-term dependencies are still problematic. There are some hacks to make things better but they are not principled solutions.

And This is where ‘Attention’ comes in.

6. What is Attention:

Attention is a little modification of the previous approach. We no longer try to encode the full source sentence into a final fixed-length vector. Rather, we make use of all the middle or local vectors information collectively in order to decide the next sequence while decoding the target sentence.

For example, in the Fig-8 we will use all the h’s and c’s instead of only using h5,c5. So now if our decoder wants to decode ‘This’ into ‘यह’, it can directly access first state vectors h1,c1. This idea is known as giving more attention to the current word and hence the name Attention. Intuitively you can think that decoder will attend to the first English word when producing the first Hindi word and so on.

Implementation of Custom Keras Layer:

To implement our own attention mechanism we need to write the custom Keras layer.

To implement it in Keras we need to write just three simple methods.

build(input_shape): This is where you will define your weights. This method must set self.built = True at the end, which can be done by calling super([Layer], self).build().
call(x): This is where the layer's logic lives.
compute_output_shape(input_shape): In case your layer modifies the shape of its input, you should specify here the shape transformation logic.

7. Implementing different types of Attention mechanisms:

The above explanation is a top-level view. We need to dig deep to fully understand it. Since 2015 Attention became quite a popular concept and tool to solve NLP problems. In recent years, many advancements and modifications happen on top of simple Attention mechanisms which leads to different types of models. While the underlying principles of Attention are the same in all different types, their differences lie mainly in their architectures and computations. Some of the popular ones I will discuss below with implementation in Keras. We can use the GRU unit instead of LSTM just for simplicity as GRU consists of a single hidden state as compared to LSTM which maintains two hidden states.

7.1- Attention proposed by Bahdanau:

Right now we have all the encoded state vectors (h1) to (h5). The attention layer will take these encoded vectors and the internal state of the decoder at the previous time step as input. For predicting the first word itself, the decoder does not have any current internal state. For this reason, we will consider the last state of the encoder (i.e. h5) as the previous decoder state. Using these vectors alignment score will be calculated which will then be used to calculate attention weights. This is discussed in detail below:

The decoder is trained to predict the output y at time step t given the context vector c and previously predicted outputs.

Here y is the previously predicted outputs and c is the context vector. In other words, we can say that the decoder is trying to find the conditional probability stated above. The conditional probability part can be modeled as:

where g is a nonlinear, potentially multi-layered, function that outputs the probability of ‘yt’, and ‘st’ is the decoder hidden state of the GRU/LSTM which is calculated as:

Here you can see that probability is conditioned on distinct vector c which is context vector for every targeted word. The context vector ci depends on the encoder outputs sequence (h1, h2, …hTx ) to which the encoder maps the input sentence.

The context vector ci is, then, computed as a weighted sum of these state vectors.

If we unroll the above formula for our example of five words, we will get:

context_vector = (α1 * h1 + α2 * h2 + α3 * h3 + α4 * h4 + α5 * h5)

is an alignment model that scores how well the inputs around position j and the output at position i match. The score is based on the decoder GRU/LSTM hidden state si−1 and the jth output hj of the input sentence. Here ‘a’ is a simple feedforward neural network which is jointly trained with all the other components of the proposed system. These alignment model eij are also called energy scores.

To calculate the energy_score/allignment_score the paper introduces three weight matrices W_combined, W_decoder, W_encoder. The below formula will make it clear.

Let’s define these weights in the build function of the class AttentionLayer.

We have the weights defined required for the calculation of alignment_score. Now let’s go and define the call function of the class where the actual calculation and logic of the code will reside.

The attention layer will finally return allignment_score and attention weights. The weights will be then used to calculate the final context vector.

context_vector = attention_wts * enc_output_seq

Our final Attention Layer class is defined below:

2 - Attention proposed by Luong:

This mechanism was introduced by Thang Luong in 2015 after the Bahdanau’s mechanism. It was built on top of the previous mechanism and has some differences.

The way that the alignment score is calculated.
The position at which the Attention mechanism is being introduced in the decoder.

In the Bahdnau Attention, the attention layer takes encoded input vectors and decoder hidden vector at the previous time step as input. Then it calculates the context vector using that. Using that context vector output is predicted. But in Luong Attention, the context vector is utilized only after the RNN/LSTM produced the output for that time step. At the decoder level, The previous decoder hidden state and decoder output is passed through the Decoder RNN to generate a new hidden state for that time step. Using the new decoder hidden state and the encoder hidden states, alignment scores are calculated.

There is a total of three types of way the alignment scores are calculated as compared to Bahdanau’s which has only one type. These types are:

Dot:

In this function, we only need the hidden state of the decoder and hidden state of the encoder to calculate the alignment score.

General:

This is similar to Dot function except it has a weight matrix after the dot function.

Concatenate:

This function calculates the score in the following way.

Here H_decoder is the new hidden state which is generated by passing previous hidden state and decoder output.

Below is the implementation of Luong Attention:

8. The drawback of Attention:

As stated in the above paragraph, it computes context_vector for every word which comes at a computation cost. Longer the sequences more will it take time to train. Also if you see Human attention is something that’s supposed to save computational resources. By focusing on one thing, we can neglect many other things. But that’s not really what we’re doing in the above model. We’re essentially looking at everything in detail before deciding what to focus on. Intuitively that’s equivalently outputting a translated word, and then going back through all of your internal memory of the text in order to decide which word to produce next. That seems like a waste, and not at all what humans are doing. In fact, it’s more akin to memory access, not attention, which in my opinion is somewhat of a misnomer. Still, that hasn’t stopped attention mechanisms from becoming quite popular and performing well on many tasks.