What is Attention in NLP?

Jaimin Mungalpara
Nerd For Tech
Published in
6 min readFeb 26, 2021

In this blog we will look on the pivotal research in the area of NLP which has changed the view of NLP for entire world that is called Attention.

In my previous articles we have seen RNN, LSTM, Bi Directional RNN, Encoder Decoder. Also, all of these architecture have their own limitations. However, in encoder decoder architecture the problem of sematic information and co-relation with each words was solved, it still have some limitations. Firstly, whenever we deal with long sentences we can loose information due to architecture. Secondly, we are allowing only context vector which is the output of last encoder cell, is given as an input to decoder. This means we can loose semantic information and dependency of words in sentence generation.

To resolve all the issues with previous approaches attention was introduced first time in 2016. In this research paper it was mentioned how we can get sematic information of word and co-relation of words in generation of output using an Attention.

Let’s compare both the architecture of traditional encoder decoder and with attention.

Encoder Decoder Architecture

http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/

In this architecture output of last encoder cell in form of context vector is given to decoder.

Encoder Decoder Architecture With Attention

In this architecture attention layer is added, output Oi from each layer of RNN cell at Encoder is given to attention layer and output of attention layer is given to decoder RNN cell.

The very first attention was introduced by Dzmitry Bahdanau, which is additive attention. The aim to improve seq2seq model with addition of attention. Let’s take deep dive into step by step implementation of this model.

1) Encoder

Encoder structure is same as seq2seq architecture. Here, according to research paper mentioned above bi-directional RNN cell is taken into architecture.

Input is given to each RNN cell and the state of each and every RNN cell Oi is sent to next layer which is shown in below figure.

Code Implementation of Encoder in Tensorflow

We have used tf.keras to build encoder model. We have used __init__ method and call method to build encoder architecture. In Encoder class we have defined a method to initiate hidden state. In __init__ method we defined the layers which we would require in this architecture. For example, embedding and GRU layer. We have added parameters also which required to build this layer. We have added below parameters .

  1. batch_size : This is the size of data which would be passed in each epochs. This parameter can affect highly on model performance
  2. vocab_size : Total number of unique words in training dataset
  3. enc_units: Total number of LSTM/GRU units
  4. embedding_dim: This parameter affects the computational power of the model. This represents the dimension of embedding layer which is passed to LSTM/ GRU layer in next stage.

In call method we have just passed the process in which manner it would be followed. In the forward propagation path the input is taken and passed to embedding layer. Then embedding data and initial hidden state are passed to LSTM/GRU layer.

2) The Attention Mechanism

The goal of attention is to get contextual information for the decoder.

Feed Forward Neural network for attention mechanism

This mechanism will identify the weights of each input X1,X2,X3,X4. For Example, context vector C2 is dependent on X2 and X4 input then it would lower the weights of X1 and X3. This way the attention is being taken on input data and it is passed to decoder along with hidden state of decoder.

We are calculating context vector based on weight α input output O1 at each RNN cell of encoder. Context vector C can be calculated by below formula.

Context vector C1 calculation

So, context vector for yt can be calculated with below formula according to research paper.

Ci is context vector

αij is alignment weight

Context vector is simple sum of hidden state weighted by alignment score. Here, we are discussing additive attention which is called Bahdanau’s attention and other one is multiplicative attention which is called Luong’s attention. As we are taking about attention one thing should be noted that αij should be ≥ 0 as negative relation with negative weight is not required. Also the summation of αij should be = 1.The weight αij of each annotation hj is computed by

https://arxiv.org/pdf/1409.0473.pdf

This equation is kind of a Softmax function the output range will 0 to 1, and the sum of all the probabilities will be equal to one which satisfies the rule mentioned above. So, the next term is eij which plays pivotal role is called and attention function. Formula of attention function is denoted as

https://arxiv.org/pdf/1409.0473.pdf

eij, which scores how well the the input at jth and output at ith position match. Encoder output hj and the decoder hidden state of previous cell si-1 will calculate the attention score. According to research paper eij is actually calculated with below formula.

https://arxiv.org/pdf/1409.0473.pdf

Where, Wa , Ua and va ∈ R n are the weight matrices. Entire process is mentioned in below figure.

Attention Mechanism how it works step by step

Code Implementation of Bahdanau’s attention in Tensorflow

In BahdanauAtenntion class, same as an Embedding layer we have create init and call method. In init method 3 Dense layers have been created which are W1,W2 and V. As shown in figure above figure both the metrics are used as Wa and Ua as per formula. We pass encoder output and decoder current state to W1 and W2.

In call method W1 and W2 are given query and value which are hidden state and encoder output respectively. Then score is calculated with tanh function and this score is passed threw softmax function to generate attention weight. Finally, attention weight and value are used to generate context vector Ci.

3) The Decoder

The decoder model is same as described in Encoder Decoder article, only the difference is attention layer is added so input to the decoder is changed respectively.

Here, the input context vector Ci is calculated from encoder output and decoder hidden state St-1. Then Context vector with output Y is given as an input to RNN cell . This is the difference in decoder architecture.

Code Implementation of Decoder in Tensorflow

Here, We are creating same methods init and call. In init method we are passing embedding , GRU and dense layer as per general decoder architecture. Noe, in call method we are initiating attention layer with context vector and attention_weights as an output. Then, context vector is concatenated with decoder input. This is given to GRU and fully connected layer for final output. Out put is given as x, state and attention_weights which would be used to adjust weights in back propagation.

In this blog we have discussed about Encoder Decoder with attention Mechanism. In next article I will add transformer architecture which has base from above mention architecture.

Suggestion are welcomed always.

References

  1. https://www.tensorflow.org/tutorials/text/nmt_with_attention
  2. https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
  3. https://towardsdatascience.com/sequence-to-sequence-models-attention-network-using-tensorflow-2-d900cc127bbe
  4. https://arxiv.org/pdf/1409.0473.pdf

--

--