How Attention Works in Neural Network?

3 min readSep 19, 2017

This is a summary of https://distill.pub/2016/augmented-rnns/.

Neural Turing Machine

Neural Turing Machine extends RNN with external writing and reading memory. The read and write operations are based on the attention mechanism which identifies the memory location to read and write.

To read a memory location, a attention vector, whose size tantamount to the memory size, will specify the magnitude of the each memory states should be extracted. Just like human being, when we are doing math, we will pay most of our attention on logical reasoning system in our left brain, while we are dealing with language, we will focus on the linguistic system in our right brain. The final read-out vector will be the weighted sum of the attention vector and the memory vector.

And writing operation is similar.

However, here we come to a question. How do we know positions in memory to focus their attention on? They actually use a combination of two different methods: content-based attention and location-based attention. Content-based attention allows NTMs to search through their memory and focus on places that match what they’re looking for.

A controller will make a query vector (like when we ask how to do 1+1=?) and we will find the similar memory location (like integer addition), and form a similarity vector. The similarity vector then is undergone a softmax operation and converted into a vector that indicate the probability of the cell is related to query vector. It is basically the attention vector. But we can do a step further to interpolate the attention from the previous step. In the diagram below, it is near zero.

After we find the attention vector, we can also consider whether we allows relative movement in memory, enabling the NTM to loop. And this is location-based attention. Below diagram shows the attention vector obtained in the last step from the content-based attention. This vector will be convolved with a shift filter to find the new attention vector, similar to the convolution neural network.

Attentional Interface

It seems to me that the attentional interface works for the seq2seq model which receive a input sequence A, generate a output sequence B.

One of the best and easiest way to understand this is a translation model. You will see how each words influencing the generation of the translated words.

How Attention Works in Neural Network?

Neural Turing Machine

Attentional Interface

Written by Kit Yeung