Attention Model

Dharti Dhami
3 min readDec 21, 2018

We looked at the Encoder-Decoder(seq2seq) architecture for machine translation. Where one RNN reads in a sentence and then different one outputs a translation of the sentence.

If we have a very long French sentence, in the encoder/decoder setup, we ask the network to read in the whole sentence, memorize it in the activations and then generate the English translation. The Encoder-Decoder architecture works quite well for short sentences, but for very long sentences(30 to 40 words), the performance goes down. If you think of how a human translator would operate on a long sentence, they would typically generate the translation part by part because it’s just really difficult to memorize the long sentence.

There’s a modification to the seq2seq model called the Attention Model, that makes all this work much better. The attention idea has been one of the most influential ideas in deep learning. Even though Attention model was developed for machine translation, it spread to many other application areas like image captioning.

What the Attention Model computes is a set of attention weights (for each step) to denote how much attention to give to each word in the input when generating the output word. So for generating the first word of output translation we probably need to look at word 0, 1, 3,5 from the input and so these weights would be constructed such that those words are used when generating the translation.

Model

Attention model uses 2 RNN models. First is a bidirectional RNN(with GRU or LSTM cells) to compute features on every word. And so for the forward occurrence, we have a activation for each time steps and same for backward occurrence.

Here we use t_prime to index into the words in the French sentence. So a<t_prime> is a feature vector for time step t which consists of both the forward and backward activation at that time step. These activations are used as features into the second RNN model.

The second model is a single direction RNN that is used to generate the translation one word at a time. For the first time step, it takes input activation (let’s call this S<0>) and Context C<1> to generate y1. Context C depends on the attention parameters applied to the activations in the previous bidirectional RNN model.

The second network is a pretty standard RNN sequence with the context vectors as output and we generate the translation one word at a time.

How do we compute attention weights ?

One way to do so is to use a small neural network to calculate the vector e. The input to this network would be s< t — 1> and a<t_prime>.

One downside to this algorithm is that it takes quadratic time to run this algorithm. If we have tx words in the input and ty words in the output then the total number of these attention parameters are going to be tx times ty. Although in machine translation applications where neither input nor output sentences is usually that long quadratic cost is actually acceptable.

--

--

Dharti Dhami

Mom, Tech Enthusiast, Engineering lead @Youtube Music.