Attention Mechanism: A Quick Intuition

Nihal Das
Nihal Das
Dec 21, 2019 · 5 min read

Contents:

  1. Introduction

1. Introduction

In this article, we will try to understand the basic intuition of attention mechanism and why it came into picture. We aim to understand the working of encoder-decoder models and how attention helps achieve better results.

We will see how to build a custom attention layer with Keras and default attention layer that is provided with TensorFlow 2.0.

2. Sequence-to-Sequence Models

A sequence to sequence model aims to map a fixed-length input with a fixed-length output where the length of the input and output may differ.
Example: “You are reading this article” in English is “Vous lisez cet article” in French where
Input length = 5 words
Output length = 4 words
Here LSTM/GRU would fail to map each word of English to French hence we use a sequence to sequence model to address problems like this one.

The encoder-decoder model for recurrent neural networks(RNN) is a powerful type of sequence-to-sequence models. These models are used a lot in the field of natural language processing with use-cases like machine-translation, image captioning and text summarisation.

Google Translate
Google Translate
Google Translate

A sequence to sequence model has two components, an encoder and a decoder. The encoder encodes a source sentence to a concise vector (called the context vector), where the decoder takes in the context vector as an input and computes the translation using the encoded representation.

3. Problem with Seq2Seq Models

A problem with these models are that, performance decays as the length of the input sentence increases. The reason being —

  • The words to be predicted depends on the context of the input sentence and not on the single word. So, basically to predict a word in French, we might make use of 2–3 words in the English sentence. That is how humans translate one language to another.

Now, with the length of the input sentence increasing, over time our LSTM/GRU loses the context of the long sentence thereby losing the meaning of the whole sentence and eventually resulting in poor performance.

4. Need for Attention

To tackle the above mentioned limitations, Attention mechanism was introduced.

The whole idea of attention is instead of relying just on the context vector, the decoder can have access to the past states of the encoder. At each decoding step, the decoder gets to look at any particular state of the encoder.

Attention mechanism tries to identify which parts of the input sequence are relevant to each word in the output and uses the relevant information to select the appropriate output.

Working

Here, we make use of bi-directional GRU cells where the input sequences are passed both in forward and backward direction. The output is then concatenated and passed on to the decoder.

Bi-directional GRU cells

In order to tackle the limitation, we use the weighted sum of selected number of past encoded states. We have two constraints:-
1. Number of past states necessary
2. Weight for the selected past states
Since these constraints can be learned with back-prop, we can assume this to be as a layer which fits between the encoder and decoder

Attention Mechanism in Encoder-Decoder Model

5. Custom Keras Attention Layer — Code Example

This is a code snippet that was used to create Attention layer for one of the problems.

from keras.layers import Layer
import keras.backend as K
class Attention(Layer):
def __init__(self,**kwargs):
super(attention,self).__init__(**kwargs)

def build(self,input_shape):
"""
Matrices for creating the context vector.
"""
self.W=self.add_weight(name="att_weight",shape=(input_shape[-1],1),initializer="normal")
self.b=self.add_weight(name="att_bias",shape=(input_shape[1],1),initializer="zeros")
super(attention, self).build(input_shape)

def call(self,x):
"""
Function which does the computation and is passed through a softmax layer to calculate the attention probabilities and context vector.
"""
et=K.squeeze(K.tanh(K.dot(x,self.W)+self.b),axis=-1)
at=K.softmax(et)
at=K.expand_dims(at,axis=-1)
output=x*at
return K.sum(output,axis=1)

def compute_output_shape(self,input_shape):
"""
For Keras internal compatibility checking.
"""
return (input_shape[0],input_shape[-1])

def get_config(self):
"""
The get_config() method collects the input shape and other information about the model.
"""
return super(attention,self).get_config()
Flow of calculating Attention weights in Attention Layer
Flow of calculating Attention weights in Attention Layer
Source: Link

Model building

input_text_bgru = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='float32')
embedding_layer_bgru = Embedding(len(tokenizer.word_index) + 1,
300,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
g = embedding_layer_bgru(input_text_bgru)
g = SpatialDropout1D(0.4)(g)
g = Bidirectional(GRU(64, return_sequences=True))(g)
att = Attention(MAX_SEQUENCE_LENGTH)(g)
g = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "he_uniform")(g)
avg_pool1 = GlobalAveragePooling1D()(g)
max_pool1 = GlobalMaxPooling1D()(g)
g = concatenate([att,avg_pool1, max_pool1])
g = Dense(128, activation='relu')(g)
bgru_output = Dense(2, activation='softmax')(g)

6. Update with TensorFlow 2.0

With TensorFlow 2.0, Attention layer has been added as one of the layers and can now be directly implemented without defining it explicitly.

query_value_attention_seq = tf.keras.layers.Attention()(
[query_seq_encoding, value_seq_encoding])

This makes it easier to implement and becomes less cumbersome for Machine learning developers while designing complex architecture.

7. Conclusion

Attention Mechanism doesn’t limit itself to machine translation. It is used in image captioning where we use Visual Attention with help of CNN to get the feature maps.

Attention mechanism has uses beyond what we mentioned in this article. Hopefully you could get a general overview as to what problem Attention Mechanism is trying to solve. We implemented basic Attention Mechanism in Seq2Seq models with RNNs in this article. However, Transformer models like Google’s BERT and XLNet are major advancements which makes use of self-attention mechanism, are currently state-of-the-art in the field of NLP.

8. References

Connect with me

*** Thank you all for reading this article. Your suggestions are very much appreciated! ***

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Nihal Das

Written by

Nihal Das

Engineer @ Qualcomm | Machine Learning | Deep Learning | AI | Python | LinkedIn: https://www.linkedin.com/in/nihal-das/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade