Published in


A Brief Overview of Attention Mechanism

What is Attention?

Before Attention mechanism, translation relies on reading a complete sentence and compress all information into a fixed-length vector, as you can image, a sentence with hundreds of words represented by several words will surely lead to information loss, inadequate translation, etc.

However, attention partially fixes this problem. It allows machine translator to look over all the information the original sentence holds, then generate the proper word according to current word it works on and the context. It can even allow translator to zoom in or out (focus on local or global features).

Attention is not mysterious or complex. It is just an interface formulated by parameters and delicate math. You could plug it anywhere you find it suitable, and potentially, the result may be enhanced.

Why Attention?

Vanilla RNN (the classic one) often gets trapped when modeling:

  1. Structure Dilemma: in real world, the length of outputs and inputs can be totally different, while Vanilla RNN can only handle fixed-length problem which is difficult for the alignment. Consider an EN-FR translation examples: “he doesn’t like apples” → “Il n’aime pas les pommes”.
  2. Mathematical Nature: it suffers from Gradient Vanishing/Exploding which means it is hard to train when sentences are long enough (maybe at most 4 words).

Translation often requires arbitrary input length and out put length, to deal with the deficits above, encoder-decoder model is adopted and basic RNN cell is changed to GRU or LSTM cell, hyperbolic tangent activation is replaced by ReLU. We use GRU cell here.

Embedding layer maps discrete words into dense vectors for computational efficiency. Then embedded word vectors are fed into encoder, aka GRU cells sequentially. What happened during encoding? Information flows from left to right and each word vector is learned according to not only current input but also all previous words. When the sentence is completely read, encoder generates an output and a hidden state at timestep 4 for further processing. For encoding part, decoder (GRUs as well) grabs the hidden state from encoder, trained by teacher forcing (a mode that previous cell’s output as current input), then generate translation words sequentially.

It seems amazing as this model can be applied to N-to-M sequence, yet there still is one main deficit left unsolved: is one hidden state really enough?

Yes, Attention here.

How does attention work?

Similar to the basic encoder-decoder architecture, this fancy mechanism plug a context vector into the gap between encoder and decoder. According to the schematic above, blue represents encoder and red represents decoder; and we could see that context vector takes all cells’ outputs as input to compute the probability distribution of source language words for each single word decoder wants to generate. By utilizing this mechanism, it is possible for decoder to capture somewhat global information rather than solely to infer based on one hidden state.

And to build context vector is fairly simple. For a fixed target word, first, we loop over all encoders’ states to compare target and source states to generate scores for each state in encoders. Then we could use softmax to normalize all scores, which generates the probability distribution conditioned on target states. At last, the weights are introduced to make context vector easy to train. That’s it. Math is shown below:

To understand the seemingly complicated math, we need to keep three key points in mind:

  1. During decoding, context vectors are computed for every output word. So we will have a 2D matrix whose size is # of target words multiplied by # of source words. Equation (1) demonstrates how to compute a single value given one target word and a set of source word.
  2. Once context vector is computed, attention vector could be computed by context vector, target word, and attention function f.
  3. We need attention mechanism to be trainable. According to equation (4), both styles offer the trainable weights (W in Luong’s, W1 and W2 in Bahdanau’s). Thus, different styles may result in different performance.


[1] Vinyals, Oriol, et al. Show and tell: A neural image caption generator. arXiv:1411.4555 (2014).
[2] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014).
[3] Cho, Kyunghyun, Aaron Courville, and Yoshua Bengio. Describing Multimedia Content using Attention-based Encoder–Decoder Networks. arXiv:1507.01053 (2015)
[4] Xu, Kelvin, et al. Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044 (2015).
[5] Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. End-to-end memory networks. Advances in Neural Information Processing Systems. (2015).
[6] Joulin, Armand, and Tomas Mikolov. Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets. arXiv:1503.01007 (2015).
[7] Hermann, Karl Moritz, et al. Teaching machines to read and comprehend. Advances in Neural Information Processing Systems. (2015).
[8] Raffel, Colin, and Daniel PW Ellis. Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems. arXiv:1512.08756 (2015).
[9] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., & Gomez, A. et al. . Attention Is All You Need. arXiv: 1706.03762 (2017).

Tech Analyst: Qingtong Wu



We produce professional, authoritative, and thought-provoking content relating to artificial intelligence, machine intelligence, emerging technologies and industrial insights.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store

AI Technology & Industry Review — | Newsletter: | Share My Research | Twitter: @Synced_Global