What You Never Knew About Attention Mechanisms

Published in

SFU Professional Computer Science

12 min readFeb 11, 2022

Greg Mehdiyev, Ray Hong, Jinghan Yu, Brendan Artley

This blog is written and maintained by students in the Master of Science in Professional Computer Science Program at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/mpcs}.

Introduction

Where are your eyes drawn to in this photo? Most of us will admit that our eyes are drawn to the blue duckling. To humans, the blue duckling sticks out like a sore thumb. Somehow, humans have become exceptional at finding patterns and diverting our attention to identifying features.

Why? What exactly brings our attention to the blue duck?

If you take a closer look at the photo, you can see that there are other ducks that are different. For example, there are three ducks facing sideways instead of forward.

What makes the blue duck stand out more than the ducks facing sideways and can we teach computers to learn this?

This is exactly what the Attention Mechanism aims to solve. “Attention Mechanism is also an attempt to implement the same action of selectively concentrating on a few relevant things, while ignoring others in deep neural networks.” ⁷

In the very general sense, the attention mechanism is an improvement to the encoder-decoder architecture. An encoder-decoder model uses a neural network to translate one input to another through the use of encoded feature representation. The attention mechanism component gives a neural network the ability to give “attention” to specific features when encoding the data. It helps to solve the vanishing/exploding gradient problem that often occurs in neural networks.

A high-level overview of the attention mechanism implementation is as follows.

Assign a score to each state in the encoder state

The input sequence is encoded and these encoded parts are known as the internal states. We can identify relevant encoder states by assigning high scores for states which contain “attention” while assigning low scores for states which do not contain any relevant information.

2. Compute the attention weight

The attention weight is calculated based on the scores from the previous step.

3. Compute the context vector

The context vector can be thought of as an aggregated vector containing the information from step 1 and 2.

4. Feedforward

The information gathered from the context vector is then fed into the encoder / decoder layer.

5. Decode

The Decoder then decodes the information using the Attention Mechanism

Now that we have a general understanding of how Attention Mechanisms work, let’s dig into some real world applications!

Common Use Cases

Natural Language Processing (NLP)

Natural Language Processing is a subset of machine learning focused on giving computers the ability to interpret human language. Tools such as translation software, chat bots, and virtual assistants all come from this area of study. One of the major challenges associated with NLP is translating the context of each word in a sentence to a format that a computer understands.

Normally, translating this contextual information was done through the use of two RNNs/LSTMs in the form of an encoder and decoder. The encoder would summarize sentence information using a feature representation, and the decoder would translate this representation into a summary.

Although this approach works well with short sentences, it becomes less accurate with long sentences. This is due to the vanishing/exploding gradient problem. Without Attention mechanism, the effectiveness of this approach is less reliable with complex human language. For example, consider the following sentence.

We can compare the effectiveness of a model that does not have an attention mechanism incorporated, compared to one that does. In the following visualizations, the words with more “importance” are given a darker text. We can see that without the attention mechanism component, the model is affected by exploding/vanishing gradients, and as a result fails to take into account words found early on in the sentence. This can lead to missing out on parts of the sentence that contribute important information to the overall meaning.

This is where the Attention Mechanism has its value. It is able to take an entire sentence into account when creating a context vector and give each word a relative importance regardless of the sentence length. The ‘attention’ of the model can now be focused on what are truly the most important words in the sentence. The attention of the model may look like the following.

2. Computer Vision

The other area of machine learning that has benefited from the Attention Mechanism is computer vision. This area focuses on automating tasks that the human visual system can do. Current applications of computer vision include object detection, image classification, and image captioning.

Image captioning is the process of automatically generating a textual description of an image. This description should be an accurate representation of the content in the image in a legible format. Let’s dive deeper into the benefits of adding Attention Mechanism to the model architecture.

First, performing image captioning without the attention mechanism, our model could generate a textual representation such as “a group of yellow rubber ducks”. Since we are generalizing based on the entire image, this is pretty accurate. That being said, there is no mention of the blue duck, which is an obvious focus of the image. Since each area of the image is given equal importance, the blue duck does not make it into the description.

Without the Attention Mechanism - Source

On the other hand, consider the case in which we utilize Attention Mechanism. The model assigns more importance to the area of the image that contains the blue duck, and generates a description taking this information into account.

The description generated would be something like “a blue rubber duck in a group of yellow rubber ducks”. Notice how the main subject in the description is the “blue duck”. As the Attention Mechanism assigns more importance to this area of the image, the resulting textual description is more accurate.

Now we have covered some common use cases of Attention Mechanism, let’s look at the underlying mathematics.

Mathematics

Throughout the years, there have been many variations of Attention Mechanism. Three well-known versions of this are the Vaswani ³, Bahdanau ² and Luong ¹¹ Attention Mechanism. In this article, we will be focusing on the Vaswani Attention Mechanism and the Bahdanau Attention Mechanism.

The idea behind the Attention Mechanism is to map a query and a set of key-value pairs to an output.

Key / Value / Query

“The key/value/query concept is analogous to retrieval systems. For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).” ⁵

Key, Query and Value vectors are the abstractions of the embedding vectors in different subspaces, so we get the vectors by multiplying the embedding E by a weight matrix. Moreover, “An embedding (vector) is a relatively low-dimensional space into which you can translate high-dimensional vectors” ⁶

2. Output

The output here is a combination of the weight and value, where the weight is obtained by the softmax function applied to the dot-product of the query and key.

Vaswani Attention

In the Vaswani attention, the key, query and value vectors are the input to the encoder-decoder layer. The length of the key and the query can be represented with a variable d. If the input consists of queries and keys of dimension d, compute the dot products of the queries with all the keys. Then each result is divided by the square root of d. Lastly, apply the softmax function to obtain the weights of their values.

Suppose we have a sentence with four words (s1, s2, s3, s4) and we want to calculate the Attention for s4. s4 relies on s3, which relies on s2 and so forth. Let’s name the query of s3 as q3 and the keys of s1, s2, s3 as k1, k2, k3. To compute the weights, we need to compute the dot products of the query q3 with all the keys, then divide it by the square root of d, and apply the softmax function. The result will be the following three weights. We will call these (w1, w2, w3).

Suppose the value of s1, s2, s3 is (v1, v2, v3), the context vector for s4 is the dot product of (w1, w2, w3) and (v1, v2, v3).

Dot Product Image

Next, let’s calculate the attention on a set of queries. The queries, keys, and values are packed together into matrices Q, K, V. Key, Query and Value matrices can be obtained by multiplying the combined matrix of embedding vectors by a weight matrix Wq, Wk, Wv. This gives us an Attention matrix in the following form which can be used to apply the attention to a machine learning model (Seq2Seq, Image Captioning, BERT, etc).

Bahdanau Attention

The next attention mechanism variation is the Bahdanau attention, which is also known as the Additive Attention. The main difference between the Bahdanau and Vaswani variation is that the Bahdanau variation uses an additive strategy, whereas the Vaswani variation uses a multiplicative strategy. There are also different scaling factors between the two implementations.

The Bahdanau variation of the attention mechanism can be broken down into the following steps. First, the decoder hidden state in the previous time step is combined with the encoder hidden state (one for each element in the input sentence) to create alignment scores. Tanh is used as the scaling factor similar to how sqrt(d) was used in the Vaswani variation. This can be represented with the following formula.

Bahdanau Score Formula

Afterwards, the softmax function is used to normalize the scores into weights. Then, the weights are multiplied by the hidden encoder states to get the context vector. Finally, we concatenate the context vector and the previous decoder output to generate a new output. This process is repeated over each time step.

The detailed implementation of this Attention Mechanism is included in the code section.

Additional Info

What is Seq2Seq?

Sequence-to-Sequence learning is when a model transforms a sentence from one domain into a sentence from another. This could be in the case of language translation where one sentence is translated from one language to another.

2. Why softmax?

The softmax function takes an input vector v1 of n real numbers and normalizes it into a probability distribution containing n components. Each output component is in the range (0, 1), and the sum of the vector components will add up to 1. With the softmax function, we can generate a probability distribution function of the likelihood of each component influencing the decoder output. Finally, the context vector is concatenated with the previous decoder output and fed into the decoder RNN cell to produce a new hidden state.

3. Why scaling?

If the dimension of d is large, the dot products could grow large in magnitude. The softmax function will lead to small gradients. We scale the dot product to reduce this effect. Tanh is used in the Bahdanau variation, whereas 1/sqrt(d) is used in the Vaswani variation.

Code

In the following section, we apply the Bahdanau attention mechanism in a sequence to sequence task. We will implement an encoder-decoder architecture using Keras. You can download the complete code for this example from this Google Colab Notebook.

Our code is a modified and optimized version of the third party implementation of the attention mechanism found in the Attention Mechanism Article ⁹. Unlike the Attention Mechanism Article ⁹, we use the Attention Mechanism to design a system that translates a given English sentence to French.

Here is an example of an input and predicted output sequence that the model will produce.

Import Packages

Let’s first import the required python packages using the code below.

LanguageIndex Class

Next, we will create the LanguageIndex class that performs index mapping, (e.g,. “dad” -> 5) and vice-versa. The goal is to store all the words in a phrase for a particular language in a dictionary and to be able to reference their index. This class also stores the length of the longest sentence for each language.

Text Cleaning

Next we will create some helper functions to generate a data sequence for encoding and decoding purposes. The purpose of the helper function is to perform feature engineering. Helper functions clean sentences by removing all the punctuations, empty spaces and uncommon characters. Helpers will also convert each sentence into a a list of vectors. Each vector denotes the index of a word in a sentence.

Load Dataset + Loss Function

Next, we define the function that transforms the data and loads it into a dataset. Also included in this snippet is the loss function.

Creating the Dataset

Now we can put all of this together. Clean the input data, vectorize the tensor of the input and output languages, calculate the maximum length of the input and output sentences, and add in necessary padding. This is done with the following code.

Encoder + Decoder

Encoder

The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called the context vector.

2. Decoder

The decoder is responsible for stepping through the output time steps while reading from the context vector.

In the next cell, we define the Encoder and Decoder architecture. By default, we configure the encoder and decoder to run on CPU. However, the training process runs significantly faster on CUDA-based GPU. To train on GPUs, replace line 7 and 27 with the following.

self.gru = tf.compat.v1.keras.layers.CuDNNGRU(units, return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')

Creating the Model, Data, and Training!

Finally, we have all the pieces for a sequence to sequence translation model. Simply run the following snippet to instantiate the dataset and train the model.

Closing Thoughts

With Attention Mechanism, we are one step closer to mimicking human observation using machine learning. Feel free to download the source code from this notebook from this Google Colab Notebook. If you want to learn more about Attention Mechanism, feel free to check out some of the resources below.

References

[1] Get meaning from text with language model BERT (September 1st, 2020) https://www.youtube.com/watch?v=-9vVhYEXeyQ&t=145s

[2] Neural Machine Translation by Jointly Learning to Align and Translate (May 19th, 2016) https://arxiv.org/abs/1409.0473

[3] Attention Is All You Need (December 6th, 2017) https://arxiv.org/abs/1706.03762

[4] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (April 19th, 2016) https://arxiv.org/abs/1502.03044

[5]What exactly are keys, queries, and values in attention mechanisms? (August 1st, 2019) https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms

[6] Machine Learning Crash Course: Embeddings (October 2nd, 2020) https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture

[7] A Comprehensive Guide to the Attention Mechanism (November 20th, 2019) https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/

[8] Deep Learning: Attention Mechanism (September 15th, 2019) https://blog.floydhub.com/attention-mechanism/

[9] Attention Mechanism an Intuitive Understanding (March 20th, 2019) https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f

[10] Blog Post Code (Feb 11th, 2022) https://colab.research.google.com/drive/1HRuVWssDYPdq1eDQSYL54OjLKXQlmW0_?usp=sharing

[11] Effective Approaches to Attention-based Neural Machine Translation (September 15th, 2015) https://arxiv.org/abs/1508.04025