NLP Series: Encoder-Decoder Model and Attention Model

Shiv Shankar Dutta
Analytics Vidhya
Published in
8 min readJul 9, 2020

In this post, I am going to explain the Attention Model. To understand the attention model, prior knowledge of RNN and LSTM is needed. For RNN and LSTM, you may refer to the Krish Naik youtube video, Christoper Olah blog, and Sudhanshu lecture.

Problem with CNN in Text Analysis:

The CNN model is there for solving the vision-related use cases but failed to solve because it can not remember the context provided in particular text sequences. It cannot remember the sequential structure of the data, where every word is dependent on the previous word or sentence. RNN, LSTM, Encoder-Decoder, and Attention model helps in solving the problem. RNN, LSTM, and Encoder-Decoder still suffer from remembering the context of sequential structure for large sentences thereby resulting in poor accuracy.

To understand the Attention Model, it is required to understand the Encoder-Decoder Model which is the initial building block.

Encoder-Decoder Model

Images are taken from Sudhanshu Lecture on Encoder-Decoder Model

The Encoder-Decoder Model consists of the input layer and output layer on a time scale.

Encoder: The input is provided to the encoder layer and there is no immediate output on each cell and when the end of the sentence/paragraph is reached, the output will be given out. Each cell has two inputs output from the previous cell and current input. The cell in encoder can be RNN,LSTM, GRU, or Bidirectional LSTM network which are many to one neural sequential model. Currently, we have taken univariant type which can be RNN/LSTM/GRU. In the encoder Network which is basically a neural network, it will try to learn the weights through the input provided and through backpropagation. Once the weight is learned, the combined embedding vector/combined weights of the hidden layer are given as output from Encoder.

Decoder: The output from the Encoder is given to the input of the Decoder (represented as E in the diagram)and initial input to the first cell in the decoder is hidden state output from the encoder (represented as So in the diagram). Subsequently, the output from each cell in a decoder network is given as input to the next cell as well as the hidden state of the previous cell. Each cell in the decoder produces output until it encounters the end of the sentence. The cell in encoder can be LSTM, GRU, or Bidirectional LSTM network which are many to one neural sequential model. Currently, we have taken univariant type which can be RNN/LSTM/GRU.

Advantages:

  1. Unlike in LSTM, in Encoder-Decoder model is able to consume a whole sentence or paragraph as input.

Disadvantages:

  1. Problem with large/complex sentence: The effectiveness of the combined embedding vector received from the encoder fades away as we make forward propagation in the decoder network. As we see the output from the cell of the decoder is passed to the subsequent cell. In the case of long sentences, the effectiveness of the embedding vector is lost thereby producing less accuracy in output, although it is better than bidirectional LSTM.

Solution: The solution to the problem faced in Encoder-Decoder Model is the Attention Model.

Attention Model:

Images were taken from Sudhanshu Lecture on Attention Based Model

The Attention Model is a building block from Deep Learning NLP. The advanced models are built on the same concept.

For the large sentence, previous models are not enough to predict the large sentences. That’s why rather than considering the whole long sentence, consider the parts of the sentence known as Attention so that the context of the sentence is not lost.

How do we achieve this? As mentioned earlier in Encoder-Decoder model, the entire out from combined embedding vector/combined weights of the hidden layer is taken as input to the Decoder. For Attention-based mechanism, consider the part of the sentence/paragraph in bits or to focus or to focus on parts of the sentences, so that accuracy can be improved.

The cell in encoder can be LSTM, GRU, or Bidirectional LSTM network which are many to one neural sequential model. Currently, we have taken bivariant type which can be RNN/LSTM/GRU. The Bidirectional LSTM will be performing the learning of weights in both directions, forward as well as backward which will give better accuracy.

Referring to the diagram above, the Attention-based model consists of 3 blocks:

  1. Encoder
  2. Decoder
  3. Attention

Encoder: All the cells in Enoder si Bidirectional LSTM. There is a sequence of LSTM connected in the forwarding direction and sequence of the LSTM layer connected in the backward direction. The input of each cell in LSTM in the forward and backward direction are fed with input X1, X2 ….. Xn. The output from each cell in LSTM in the forward and backward direction are combined to produce some kind of output h1, h2 …... The number of RNN/LSTM cell in the network is configurable. If the size of the network is 1000 and 100 words are supplied, then after 100 it will encounter end of the line, and the remaining 900 cells will not be used.

Attention Model: The output from encoder h1,h2…hn is passed to the first input of the decoder through the Attention Unit. It is possible some the sentence is of length five or some time it is ten. Let us consider in the first cell input of decoder takes three hidden input from an encoder.

The multiple outcomes of a hidden layer is passed through feed forward neural network to create the context vector Ct and this context vector Ci is fed to the decoder as input, rather than the entire embedding vector.

a11, a21, a31 are weights of feed-forward networks having the output from encoder and input to the decoder.

In the above diagram the h1,h2….hn are input to the neural network, and a11,a21,a31 are the weights of the hidden units which are trainable parameters. The Ci context vector is the output from attention units.

a11 weight refers to the first hidden unit of the encoder and the first input of the decoder. Similarly, a21 weight refers to the second hidden unit of the encoder and the first input of the decoder. Also using the feed-forward neural network with bunch of inputs and weights we can find which is going to contribute more in context vector creation.

The window size(referred to as T)is dependent on the type of sentence/paragraph. This is hyperparameter and changes with different types of sentences/paragraphs. The window size of 50 gives a better blue ration.

eij is the output score of a feedforward neural network described by the function a that attempts to capture the alignment between input at j and output at i. This is the main attention function. For Encoder network the input Si-1 is 0 similarly for the decoder. the hj is somewhere W is learned through a feed-forward neural network. denotes it is a feed-forward network.

aij: There are two conditions defined for aij:

  1. aij should always be greater than zero, which indicates aij should always have value positive value. This is because in backpropagation we should be able to learn the weights through multiplication. The negative weight will cause the vanishing gradient problem.
  2. Summation of all the wights should be one to have better regularization. This is nothing but the Softmax function. Here i is the window size which is 3here.

a11, a21, a31 are weights of feed-forward networks having the output from encoder and input to the decoder. The input that will go inside the first context vector Ci is h1 * a11 + h2 * a21 + h3 * a31. Similarly for second context vector is h1 * a12 + h2 * a22 + h3 * a32.

In the attention unit, we are introducing a feed-forward network that is not present in the encoder-decoder model. The hidden output will learn and produce context vector and not depend on Bi-LSTM output.

All the vectors h1,h2.., etc., used in their work are basically the concatenation of forwarding and backward hidden states in the encoder. To put it in simple terms, all the vectors h1,h2,h3…., hTx are representations of Tx number of words in the input sentence.

The weights are also learned by a feed-forward neural network and the context vector ci for the output word yi is generated using the weighted sum of the annotations:

Decoder: Each decoder cell has an output y1,y2……yn and each output is passed to softmax function before that. The output of the first cell is passed to the next input cell and a relevant/separate context vector created through the Attention Unit is also passed as input. Note: Every cell has a separate context vector and separate feed-forward neural network.

In the image above the model will try to learn in which word it has focus. “It” is two dependency animals and street. On post-learning, Street was given high weightage.

Examples of Attention Models:

Language Translation:

Conclusion: The neural network during training which reduces and increases the weights of features, similarly Attention model consider import words during the training.

I would like to thank Sudhanshu for unfolding the complex topic of attention mechanism and I have referred extensively in writing.

--

--