From Seq2Seq to Attention: Revolutionizing Sequence Modeling

6 min readJun 26, 2023

Investigating the origin of Attention mechanism and Bahdanau attention

Introduction

In this blog post, I will be dis/the origin of the attention mechanism and then will cover the first paper that used attention for Neural machine translation. In the previous blog post Demystifying Sequence Modeling: Understanding RNNs, LSTMs, and Seq2Seq, I discussed the basics of sequence modeling and architectures like RNN, LSTM, and Seq2Seq. In this blog, I will be building upon the previous information. So if you haven’t checked out the previous post, go check it out. Seq2Seq model with 2 RNNs failed due to context compression, short-term memory limitation, and exposure bias. BLEU score of this model keeps on decreasing with increasing sequence length.

The BLEU scores of the generated translations with respect to the lengths of the sentences

Above diagram shows that Seq2Seq model with 2 RNNs fails drastically with increasing sentence length. It is not able to capture all the relevant information in big sequences. This problem gave birth to the Attention mechanism. Actually, attention had its origin way back, this time we learned how to express it mathematically and use it for machine translation.

Origin of Attention Mechanism

If we can leave everything aside and focus on how our eyes work, we can easily find the origin of attention mechanism. We can see multiple objects in front of us, but we focus on one object at once. This is our attention cue. We give more importance to a few sensory inputs and less importance to some. We can select spotlight of attention using nonvolitional and volitional cue. The nonvolitional cue is based on the saliency and conspicuity of objects in the environment. Using the volitional cue based on variable selection criteria, this form of attention is more deliberate. It is also more powerful with the subjectʼs voluntary effort.

Using the volitional cue (want to read a book) that is task-dependent, attention is directed to the book under volitional control.

Queries, Keys, and Values

Let me introduce the concept of queries, keys, and values. In the context of the attention mechanism, we refer to volitional cues as queries. Given any query, attention mechanisms bias selection over sensory inputs via attention pooling. These sensory inputs are called values in the context of attention mechanisms. More generally, every value is paired with a key, which can be thought of as the nonvolitional cue of that sensory input.

Attention Pooling

Attention pooling refers to the process of aggregating or summarizing the information contained in the attention weights produced by the attention mechanism. Attention scoring function is used to assign weights or scores to different parts of the input sequence based on their relevance to the current decoding step.

The mechanism illustrated in above figure is that for a specific query, we calculate its relevance w.r.t all the keys by using attention scoring function. Then we apply the softmax operation to get a probability distribution (attention weights). Later we calculate the weighted sum of the values based on these attention weights.

Attention Scoring Functions

There are different types of attention-scoring functions. There is additive attention, multiplicative attention, and scaled dot product attention. Bahdanau uses additive attention as scoring function. So, I will be discussing it here. Scaled dot product will be explained in next blog post which is based on ‘Attention is all you need’ paper. When queries and keys are vectors of different lengths, we use additive attention as the scoring function.

Given a query (q) and a key (k), the additive attention-scoring function first concatenates Wq and Wk. Then it is fed into an MLP with a single hidden layer whose number of hidden units is h, a hyperparameter. Tanh is used as the activation function and bias terms are disabled.

Bahdanau Attention

Bahdanau attention, with its additive attention formulation, emerged as a powerful and widely adopted attention mechanism. It provides the flexibility to capture complex alignments between the decoder and encoder states, enabling models to generate more accurate and contextually aware sequences. This architecture allows the model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word. Assign attention weight to each word, to know how much “attention” the model should pay to each word (i.e., for each word, the network learns a “context”)

Generating the t-th target word y(t) given a source sentence (x1, x2,…,xT )

The Bahdanau attention mechanism consists of three main components: the encoder, the decoder, and the attention-scoring function. The encoder consists of a bidirectional RNN and the decoder consists of a unidirectional RNN. A bidirectional recurrent neural network (BRNN) is a type of RNN architecture that processes input sequences in both forward and backward directions. It combines information from past and future contexts to make predictions or generate output at each time step, enabling the model to capture dependencies in both directions. In the above figure, the hidden state of BRNN is shown with h(t) and the hidden state of unidirectional RNN is shown with s(t).

The attention weights a(t,T) represent the relevance of each encoder hidden state to the current decoding step. These attention scores quantify how much attention should be given to each part of the input sequence. This is calculated by another feed-forward network. This network takes an input of hidden states of the encoder and decoder and outputs a value e. The attention scores are then normalized using a softmax function, transforming them into a probability distribution. The softmax function ensures that the attention scores sum up to one, allowing them to be interpreted as weights or probabilities.

In the above figure, the context vector is calculated as the weighted sum of the encoder’s hidden states with attention scores as weights. Then this context vector is fed into the decoder. The context vector is concatenated with the decoder’s previous hidden state and this combined representation is used as input for generating the next output token.

Visualization of Attention weights in translation from English to French

Closing Remarks

In conclusion, the origin of attention mechanisms and the introduction of Bahdanau attention have revolutionized the field of sequence modeling and natural language processing. The concept of attention, inspired by human cognitive processes, has allowed neural networks to focus on relevant parts of the input sequence and make informed decisions during sequence generation tasks. The journey from the early days of attention mechanisms to the breakthroughs brought by Bahdanau attention has paved the way for advancements in machine translation, text summarization, speech recognition, and other sequence-based tasks. In the next blog post, I will be covering in detail the most influential paper of this century “Attention is all you need” by (Vaswani et al.).

Thank you for reading!

Follow me on LinkedIn!

References