Attention Model has been a rising star and a powerful model in the deep learning in recent years. Especially the concept of self attention proposed by Google in 2017, which can be applied to many fields as long as related to sequence. However, the knowledge gap between attention model and self attention has not been well explained, leading to the entry barrier for self attention. In this article, I will try to integrate the evolution of Machine Translation from Seq2seq to attention model and to self attention.

This article is split into two parts. In the First part Seq2seq and Attention model are the main topic, whereas Self Attention will be in the second part.

Hope you enjoy it.

Fig. 1. Recommended paper for understanding the evolution of the model from Seq2seq to Attention model and to Self Attention.


You might have heard Sequence-to-sequence(Seq2seq) model before, which is one of the most popular models nowadays. Seq2Seq is widely used for tasks such as machine translation, Chatbot/Q&A, caption generation and other cases where it is desirable to produce a sequence from another.

Let us take a look at some examples.

Fig. 2. Some applications of Seq2seq models, including machine translation, chat-bot, and caption generation.

Since machine translation is in general use of Seq2seq model, we will use it as an instance in the upcoming paragraph.

Figure(3) is a classic Seq2seq model, which consists of Encoder and Decoder. Once inputing source into the encoder, you will get the target from the decoder.

For example, if we choose “Are you very big? ”as a source sentence, we will get “你很大?”as a target sentence. This is how machine translation works. However, looking at the formula on the right, we will find it’s easier said than done if we have little preliminary knowledge about RNN/LSTM.

Fig. 3 (left). The encoder-decoder model, translating the sentence “Are you very big?” to Chinese “你很大?”. Fig. 3 (right). The decoder formula, which is computed by hidden state and context vector.

Never mind, let’s get the ball rolling.


Unlike feedforward neural networks, RNN can use memory to remember everything when processing sequential data. However, too much memory would bring damage when training a RNN model. For example, it’s possible for the gap between the relevant information and the target where it is focused to become very large, leading to exploding/vanishing gradients — figure(4).

Fig. 4. (left). The RNN model, often facing gradient exploding/vanishing with much memory. Fig. 4 (right). The LSTM model, which consists of forget/input/output gate, cell state and hidden state.

To prevent the weakness of RNN, LSTM came up as a savior with some methods such as forget/input/output gate, cell memory, hidden state, etc. In 2014, paper Learning Phrase Representations also propose a new model called GRU, which is much simpler to compute and implement. LSTM, GRU are two particular type of recurrent neural networks that got lots of attention within the deep learning community.

These are some applications of LSTM models in Figure(5). However, we will not dive into them in this article.

Fig. 5. The applications of LSTM models.


Fig. 6(left). Same as Fig.3(left). Fig. 6(right). The decoder formula are similar to RNN formula with context vector added.

Back to the topic. In Figure(6), It’s obvious that Seq2seq model consists of Encoder and Decoder. Once inputing source sentence into the encoder, you will get the target sentence from the decoder. With the preliminary knowledge of RNN, you can find the formula of Seq2seq is similar to RNN, but with a new item called context vector. Context vector is also known as thought vector with sentence embedding information inside, which is the final hidden state of the encoder. An encoder processes the input sequence and compresses the information into a context vector of a fixed length. This representation is expected to be a good summary of the meaning of the whole source sentence. The decoder is then initialized with the context vector to generate output — figure(7).

Fig. 7(left). Detail of the encoder-decoder model. Fig. 7(right). Calculation of the Context vector and output sentence.

However, In seq2seq model, source information is compressed in a fixed-length context vector. A critical disadvantage of this design is incapability of long sentences memory and leads to bad result.

Fig. 8. Four-frame cartoons of the problems in encoder-deocder model.

In 2015, the attention model was born to resolve this problem.

Attention model

Why attention model?

The attention model was born to help memorize long source sentences in neural machine translation.

Instead of building a single context vector out of the encoder’s last hidden state, this new architecture create context vector for each input word. For example, If a source sentence has N input word, then there’s get to be with N context vectors rather than one. The advantage is that the encoded information will be much great decoded by the model.

Fig. 9(left). The attention model. Fig. 9(right). Detail of the attention model..

Here, the Encoder generates <h1,h2,h….hm> from the source sentence <X1,X2,X3…Xm>. So far, the architecture remains the same as the Seq2seq model. Source sentence as input and Target sentence as output. Therefore, we want to figure out the difference inside the model, which is the context vector c_{i} for each of the input word in Figure(10).

The question is how to calculate the context vector c_{i}.

Fig. 10(left). The decoder formula, coming with new idea like context vector and attention score for each input sequence. Fig. 10(right). A brief picture of attention score.
equation. 1 Context vector

Context vector c_{i} is a sum of hidden states of the input sequence, weighted by attention scores α — equation(1). Attention/Alignment score is the most important idea in attention model for interpreting word importance. Since attention score is calculated by score e_{ij}, let’s talked about score e_{ij} first.

equation. 2 Attention score
equation. 3 score

a is an Alignment model that assigns a score e_{i,j} to the pair of input at position j and output at position i, based on how well they fit. e_{i,j} are weights defining how much of RNN decoder hidden state s_{i-1} and the j-th annotation h_{j} of the source sentence should be considered for each output — equation(3).

Fig. 11. Global attention proposed in Effective approaches of the NMT.

score e_{ij}, later has been interpreted as part of global attention and soft attention in the paper Effective approaches of the NMT and Show, Attend and Tell. We can transform the score e_{ij} into score(h_{t}, \bar {h_{s}}), where h_{t} maps s_{i-1} and \bar {h_{s}} maps h_{j} from above — Figure(11). As you can see, there are many ways for calculating the score. We can simply take dot for easy use.

Once you have the score e_{ij}, the attention score α is then parametrized by a feed-forward network , alignment model a, calculated with softmax — equation(2) and context vector is done. This network is jointly trained with other parts of the model. With the matrix of attention scores, it is nice to see the correlation between source and target words.

Fig. 12. The x-axis and y-axis of each plot correspond to the words in the source sentence (English) and the generated translation (French), respectively. Each pixel shows the weight αij of the annotation of the j -th source word for the i -th target word , in grayscale (0 : black, 1 : white).

Finally, the encoder uses Bi-directional RNN for better understandings of the context from future time steps. Take the following examples, “She said, The rock is a good movie” and “She said, The rock is a great actor”. In the above two sentences, when we are looking at the word “The rock” and the previous two words “She said”, we might not be able to understand if the sentence refers to the movie or actor. Therefore, we need to look ahead to resolve this ambiguity. This is how Bidirectional RNN works.

Fig. 13(above). The difference between RNN and Bi-directional RNN. Fig. 13(below). Bi-directional RNN in attention model.

However, Attenion model still faces some problems. One disadvantage is that the context vector is calucated through hidden state between source and target sentence, leaving the attention inside source sentence and target sentence itself ignored — see Figure(11). Another is that RNN is hard to parallelize, leading the calculation time-consuming. Some paper trying to resolve the problem through CNN model. In fact, CNN model uses lots of layers to minimize the weakness of local information. Because of these shortcomings, Google introduced self attention to resolve them in 2017.

Fig. 14. Four-frame cartoons of the problems in attention-based model

Let’s recap a classic Seq2seq model & Attention-based model in Fig.15. The difference between them is the calculation of context vector.

Fig. 15(left). The encoder-decoder model. Fig. 15(right). The attention-based model.


So far, we have a a brief overview of the Seq2seq model and Attention-based model. I strongly recommend you to read those classic paper for a deeper understanding.

In the next part, I will be focusing on self attention, proposed by Google in the paper “Attention is all you need”. The model proposed by this paper abandoned the architecture of RNN and CNN.

Seq2seq pay Attention to Self Attention: Part 2


[1] Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translationr. arXiv:1406.1078v3 (2014).

[2] Sequence to Sequence Learning with Neural Networks. arXiv:1409.3215v3 (2014).

[3] Neural machine translation by joint learning to align and translate. arXiv:1409.0473v7 (2016).

[4] Effective Approaches to Attention-based Neural Machine Translation. arXiv:1508.0402v5 (2015).

[5] Convolutional Sequence to Sequence learning. arXiv:1705.03122v3(2017).

[6] Attention Is All You Need. arXiv:1706.03762v5 (2017).

[7] ATRank: An Attention-Based User Behavior Modeling Framework for Recommendation. arXiv:1711.06632v2 (2017).

[8] Key-Value Memory Networks for Directly Reading Documents. arXiv:1606.03126v2 (2016).

[9] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv:1502.03044v3 (2016).

[10] Deep Residual Learning for Image Recognition. arXiv:1512.03385v1 (2015).

[11] Layer Normalization. arXiv:1607.06450v1 (2016).

Ta-Chun (Bgg) Su

Written by

Data Scientist/Analyst @ Cathay Financial Holdings AI Lab.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade