Attention Mechanism in Deep Neural network Simplified

dhruv shindhe
Analytics Vidhya
Published in
4 min readApr 15, 2020
DNN using attention to convert text to image

Attention is the process of looking at a certain parts of a image or words at high resolution and the rest at low resolution. Lets say I would like to predict the next word in this sentence “I was born and bought up in Karnataka ,hence I speak fluent _______” to predict the next word here it its very easy to pay attention to “Karnataka” and say “Kannada” rather than having to look at the entire sentence and process each word. Attention is a vector of importance that that estimates how one element is strongly correlated with other elements.

Attention is widely used in NLP and Computer Vision tasks like Machine translation, Image caption generation, Microsoft’s attention GAN’s etc.

seq2seq model

Before attention Neural machine translation was done using encoder decoder model or sequence2sequrnce models as shown in the above figure.Here the encoder gives the ‘encoder vector’ or ‘context vector’ which is a summary of everything it has seen so far(this the encoders last hidden state) and this is initial hidden state of the decoder, the main drawback of this system is that if the encoder makes a bad summary the decoder output will be wrong ,its happens when the input size is very long and this is called long range dependency problem.

How attention works?

Attention was proposed as a solution to the problems with seq2seq models. Here the context vector has access to the entire input sequence by creating shortcuts between the context vector and the input sequence and the weights of these shortcuts are different for different decoder units.

Encoder-Decoder model with attention mechanism

The above figure is a Encoder-Decoder model with attention mechanism as in Bahdanau’s 2015 paper on attention.

Let’s say we have to translate a sentence from English to Hindi ,

“Hello I am Dhruv” to “नमस्ते मैं ध्रुव हूँ”.

For this the input sequence x is of length 4 and the output sequence y is of length 4.

The encoder in Fig 2 is a bidirectional RNN with both forward and backward hidden states, all the vectors h1,h2.., etc., used in their work are basically the concatenation of forward and backward hidden states in the encoder, where i = 1,2,3,4 .

The hidden states of the decoder network is given as below,

Where ct is the context vector which is a weighted sum of the hidden states of the input sequence which is weighted by the alignment scores.

The align function assigns a score that tells how important the input at position i is to the output at position t,say for the first decoder unit which has to convert “Hello” to “नमस्ते” the context vector could be as below

The alignment score are as below

Where the importance of the first hidden vector (“Hello”) is highest which implies when converting “Hello” to “नमस्ते” the network is paying more attention to “Hello” and less attention to the rest of the sequence.

In the original paper the alignment score is calculated with a FFN with a single hidden layer and the score function is given as below,

Where va and Wa are weights to be learned.

--

--

dhruv shindhe
Analytics Vidhya

I write about anything and everything that interests me.