Attention Networks: A simple way to understand Self-Attention

Geetansh Kalra
8 min readJun 5, 2022

--

“Every once in a while, a revolutionary product comes along that changes everything.” — Steve Jobs

Since the paper Attention Is All You Need came out in 2017, Attention Networks have been used widely in many fields mainly in NLP and computer vision as people understood how useful they could be. But to understand them clearly the paper wasn’t enough for me.

Source: Google Images

I had to go through many articles and videos to understand how they work. So in this article or maybe a series of articles (Just 2 or 3, PROMISE!), I will try to write down all my learnings and hope you won’t have to go through many to understand attention networks and how they work.

Introduction to Attention Mechanism

To understand with an example if I had to ask you which TV series this dialogue belongs to “A Lannister Always Pays His Debts”? To answer this question your mind won’t pay equal attention to all the words in the sentence, In fact, if you have watched the series you would know that just the word “Lannister” is enough to tell that this TV series is Game of Thrones and this dialogue was said by Tyrion Lannister.

Source: Google Images

So, the Attention mechanism is based on a common-sensical intuition that we “attend to” a certain part when processing a large amount of information.

Why were Attention Mechanism Needed in the first place?

The Attention mechanism enables the transformers to have extremely long-term memory. A transformer model can “attend” or “focus” on all previous tokens that have been generated.

Source: Illustrated Guide to Transformers- Step by Step Explanation by Michael Phi

Recurrent neural networks (RNN) are also capable of looking at previous inputs too. But RNNs have a shorter window to reference from, so when the story gets longer, RNNs can’t access words generated earlier in the sequence. This is the same for Gated Recurrent Units (GRUs) and Long-short Term Memory (LSTM) networks. They have a longer window to reference from RNN but the attention mechanism, in theory, and given enough compute resources, has an infinite window to reference from, therefore being capable of using the entire context of the story while generating the text.

Source: Illustrated Guide to Transformers- Step by Step Explanation by Michael Phi

What is Self-Attention?

Self-Attention is when it takes into consideration the relationship among words within the same sentence. To understand this in more detail we need to understand three things:

1: What is QUERY, KEY & VALUE?

2: What is Positional Encoding?

3: What do we pass to QUERY, KEY & VALUE?

So, let’s dig into the encoder model which the paper “Attention is all you need” introduces to answer all three of them.

What is QUERY, KEY and VALUE?

Source: Visual Guide to Transformer Neural Networks — (Episode 2) Multi-Head & Self-Attention

The three linear layers which you see in the above image take three things as an input — “QUERY, KEY & VALUE”. Let’s take an example to understand them too.

So if you were to search for something on Youtube or Google, The text which you type in the search box is the QUERY. The results which appear as the video or article title are the KEY and the content inside them is the VALUE. So as to find the best matches the Query has to find the similarity between it and the Keys.

To compute the similarity between the Query and the Key, we take the help of the Cosine Similarity method. It is a great way to find similarities between two vectors. Cosine similarity varies from range +1 to -1, where +1 is most similar and -1 most dissimilar.

Source: Google Images

We can rewrite the equation like

Since we were to multiply two matrices, we can transpose the B matrix.

And we can be to compute the similarity between Query and Key vector matrices so we can finally write our equation like

Now that we understand the concept of Similarity and what QUERY, KEY & VALUE, let’s get answers to the second question. What is Positional Encoding?

What is Positional Encoding?

Source: Google Images

As we know while dealing with textual data, we need to convert it into numbers before feeding it into any machine learning model, including neural networks. The embedding layer enables us to convert each word into a fixed-length vector of defined size. The resultant vector is a dense one with real values instead of just 0s and 1s. The fixed length of word vectors helps us to represent words in a better way along with reduced dimensions. This way embedding layer works as a lookup table. The words are the keys in this table, while the dense word vectors are the values.

The main reason we need Position Encoding is that unlike LSTM taking one input embedding at a time sequentially, Transformers take all embedding at once. Though this helps the transformer to be much faster they lose the information related to word order. To solve this problem authors of the paper “Attention is all you need” came up with a clever idea. They used wave frequencies to capture positional information.

Source: Attention is all you need paper

Let’s not get into much detail about this right now. We can cover this topic later in the series. If you want to read about them now, you can refer to this answer from stack exchange. For now, let’s just have the knowledge that positional encoding helped authors get a unique position for each word embedding.

What do we pass to QUERY, KEY and VALUE?

Well to the QUERY layer we feed our position-aware embeddings. We then make two more copies of that embedding and feed it to the KEY and VALUE layers too. This doesn’t make any sense, why are we feeding the same embedding to all three layers, right? Well, this is a place where SELF-ATTENTION comes into the picture.

Source: Google Images

So, Let’s take an example again. Let’s say you want to input a sentence: “Hi, How are you?” and you want your transformer to output “I am fine”.

In order to do this, we will pass the input sequence to our input embedding layer and then do the position encoding. Then we pass these positional aware embeddings to our liner layers.

Please note that before passing the positional aware embeddings to our linear layer we transpose them, thus making it of size (6x5). We do the same process for all three QUERY, KEY & VALUE linear layers.

Now, let’s focus on QUERY and KEY matrices if you remember we used them to find the similarities. The output of QUERY and KEY linear layers goes to the Matrix multiplication step in the network.

The output of this dot product can be called an Attention filter.

If we focus on Attention filters.

At first, the weights in Attention filters are more or less random numbers. But once the training process is done they take on more meaningful full values. And finally, they become Attention scores. Next, we scale these attention scores.

The authors of the “Attention is all you need” paper divided the attention score by the square root of the dimension of the key vector, in our case i.e. 6.

Finally, we squash our attention scores between the values 0 and 1 by using the softmax function.

And then we get our final Attention Filter.

The last main step in this network is to multiply the attention filter which we created with the value matrix that we left out at the start.

Now if you still have the question why on earth did we go through this? What is the purpose of this attention filter? Let’s take a little detour and understand this with a computer vision example.

If it was an image we were working on then when we would have multiplied the attention filter with the value matrix and all the unnecessary information would have been thrown away.

Source: Visual Guide to Transformer Neural Networks — (Episode 2) Multi-Head & Self-Attention

I hope it got clear why Attention filters are necessary.

Finally, we pass the result of the Attention filter x Value matrix to a linear layer to get the desired output shape.

Source: Attention is all you need

So, this was it. This is what Self-Attention means and how we perform it step by step. Because of this network, both NLP and Computer Vision fields benefited, and using this network they achieved great results.

In the next article, I will be covering Multi-head attention, Cross-Attention, and Masked Attention.

Hope this helps!!

References:

1: Visual Guide to Transformer Neural Networks — (Episode 2) Multi-Head & Self-Attention: https://www.youtube.com/watch?v=mMa2PmYJlCo&t=309s

2: Illustrated Guide to Transformers- Step-by-Step Explanation: https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0

3: Attention Is All You Need: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

4: Introduction to Deep Learning: Attention Mechanism: https://www.youtube.com/watch?v=d25rAmk0NVk

5: The Attention Mechanism from Scratch: https://machinelearningmastery.com/the-attention-mechanism-from-scratch/

6: A Comprehensive Guide to Attention Mechanism in Deep Learning for Everyone: https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/

--

--

Geetansh Kalra

Hello People. I am working as Data Scientist at Thoughtworks. I like to write about AI/ML/Data Science Topics and Investing