Understanding Self Attention in Transformers

17 min readApr 13, 2024

Welcome to a journey into the heart of AI innovation! In this blog, we’ll explore the game-changing concept of self-attention. Imagine AI that not only understands data but also learns to prioritize what’s important all by itself. That’s the power of self-attention. Join us as we uncover how this simple yet revolutionary idea is transforming the way machines perceive and interpret information, unlocking a new era of intelligent computing. Let’s dive in and discover the magic of self-attention together!

Before we dive into the world of self-attention, I want to extend my heartfelt gratitude to my mentor, Nitish Sir. His exceptional guidance and teachings on the CampusX YouTube channel have been instrumental in shaping my understanding of this complex topic. With his support, I’ve embarked on this journey of exploration and learning, and I’m excited to share my insights with you all. Thank you, Nitish Sir, for being an inspiration and mentor!

What is Self Attention ?

Before delving into self-attention, it’s essential to grasp the concept of word embedding.

Word Embedding :

In natural language processing (NLP), we transform words into numerical representations, a process known as vectorization. Initially, methods like one-hot encoding were used, followed by bag-of-words and TF-IDF approaches. However, the advent of word embeddings has revolutionized this field.

Word embeddings are dense vector representations of words in a low-dimensional vector space, enabling words with similar meanings or contexts to be closer together in that space. In simpler terms, word embeddings are a way to represent words as vectors (arrays of numbers) so that their relationships and meanings can be mathematically analyzed.

These embeddings are learned during the training process of a machine learning model, such as a neural network, to capture semantic and syntactic relationships between words. By representing words as dense vectors, word embeddings help capture the context and meaning of words, making it easier for models to understand and process text data.

To further illustrate the concept of word embeddings, consider a simple example where words are passed into a neural network, and the network learns to represent each word as a 5-dimensional vector with random initial values.

For instance, let’s say our neural network assigns the following initial vectors to three words: “king”, “queen”, and “cricketer”.

King: [0.1, 0.4, 0.8, 0.2, 0.9]
Queen: [0.3, 0.5, 0.6, 0.7, 0.8]
Cricketer: [0.9, 0.2, 0.1, 0.7, 0.5]

As the neural network gets trained on a large corpus of text data, it adjusts these vectors to better represent the words’ meanings and contexts. Through this learning process, the vectors of words with similar meanings or contexts become closer to each other in the 5-dimensional vector space, while those with dissimilar meanings or contexts move farther apart.

In this example, the final vectors learned by the neural network might look something like this:

King: [0.2, 0.3, 0.9, 0.5, 0.1]
Queen: [0.2, 0.4, 0.9, 0.5, 0.1]
Cricketer: [0.9, 0.4, 0.2, 0.1, 0.8]

As we can see, the vectors for “king” and “queen” are now more similar, representing their semantic and contextual relationships (both being royalty). On the other hand, the vector for “cricketer” remains distinct, highlighting its unrelated meaning compared to “king” and “queen”.

Since the attributes of “king” and “queen” share similarities, their vector representations would be closer together in a graphical representation, indicating a smaller angle between them. Conversely, the attributes of “cricketer” are distinct from those of royalty, resulting in a larger angle between the “cricketer” vector and both “king” and “queen” vectors.

This geometric arrangement reflects how word embeddings capture semantic relationships, with related words clustered closer together and unrelated words situated farther apart.

Problem with Word Embedding (Problem of “Average Meaning”) :

Let’s take an example to explain this problem : (see below the diagram)

We have four sentences, and we’re using a two-dimensional embedding for the word “apple.” In this embedding, the first dimension represents the taste of the apple, while the second dimension represents its association with technology. When we input each sentence into the neural network to generate the word embedding for “apple,” the resulting outputs are represented in the following diagram.

Consider the diagram above, where the embedding vector for “apple” changes with each sentence. Now, suppose our dataset consists of 10,000 sentences, with 9,000 referring to “apple” as a fruit and 1,000 mentioning it as a technology. As a result, the overall embedding vector for “apple” is [0.9, 0.3]. This indicates that the average value of the apple embedding is skewed towards the taste dimension, reflecting the prevalence of sentences where “apple” is associated with its fruity context.

The issue lies in the static nature of word embeddings. These embeddings are created once and remain unchanged, regardless of the varying contexts in which words are used across different datasets or tasks. This static representation fails to capture the dynamic nature of language and may not accurately reflect the nuances and variations in word meanings across different contexts.

Consider the scenario of building an English to Hindi translator.

Initially, we generate word embeddings for each word in the input sentence. For instance, if the embedding for “apple” is [0.9, 0.3], it suggests a strong association with the fruit due to its higher value in the taste dimension. However, in the given sentence where “apple” refers to a phone, this static embedding fails to capture the context accurately.

To address this limitation, we need contextual embeddings that adapt based on the surrounding words in the sentence. Self-attention offers a solution by dynamically adjusting embeddings according to the context in which words are used. In the example sentence, when words like “launched” and “phone” appear, the embedding for “apple” would adjust, reducing its association with taste and increasing its association with technology. This contextual awareness prevents confusion, even when words like “orange” are present.

By passing word embeddings through self-attention mechanisms, we obtain smart contextual embeddings that accurately reflect the meaning of words in their specific contexts, improving the performance of NLP tasks like translation.

From above discussion, it’s evident that self-attention serves as a mechanism for transforming static embeddings into dynamic contextual embeddings. Below, I’ve crafted a diagram illustrating this process. Beginning with the sentence ‘Humans love smartphones,’ each word undergoes static embedding generation. These embeddings are then fed into a self-attention block, where dynamic contextual embeddings are produced through the mechanism of capturing inter-word dependencies.

How does self attention work ?

In the figure, we observe two sentences where the word ‘bank’ is used in distinct contexts: one related to finance (‘bank’ as in a financial institution) and the other related to geography (‘bank’ as in the side of a river). To effectively represent the word ‘bank’ in these differing contexts, we can employ separate equations, as illustrated in the figure.

The equations presented above lead us to the conclusion that the word ‘bank’ is employed in two distinct contexts within the given sentences. In the first sentence, ‘bank’ is associated with financial transactions, depending 30% on ‘money,’ 70% on its contextual meaning (‘bank’ as a financial institution), and 10% on ‘grows.’ Conversely, in the second sentence, ‘bank’ refers to the side of a river, with a dependency of 50% on ‘river,’ 40% on its contextual meaning (‘bank’ as the side of a river), and 10% on ‘flows.’ Similarly for other words, we can write the equation as shown in the figure :

Machines do not inherently understand words, we transform them into their corresponding n-dimensional embeddings. This process is depicted in the diagram below, where each word in the sentences is represented by a vector.

An important aspect to clarify is the significance of the numbers depicted in the diagram. These numbers represent similarity scores between the embedding vectors of words. For instance, in the context of the word ‘money,’ the similarity score is 0.7 with itself (‘money’), 0.2 with ‘bank,’ and 0.1 with ‘grows.’

These numbers indicate how closely related the meanings of different words are within the embedding space. To calculate these similarity scores, we employ the dot product between the embedding vectors, allowing us to quantify the semantic relationships between words in a high-dimensional space. The higher the dot product value, the greater the similarity between the vectors, and conversely, the lower the value, the less similarity between them. The equation provided for ‘bank’ represents its contextual representation using dot product, as illustrated in the figure :

Visual representation of above equation :

In the process described, starting with the sentence ‘money bank grows,’ each word is first transformed into an embedding, depicted by the green blocks in the figure. Subsequently, we calculate the similarity scores between each word pair, resulting in sets of scores denoted by (s11, s12, s13), (s21, s22, s23), and (s31, s32, s33), illustrated by the pink blocks. These scores are then passed through a softmax function to normalize them within the range of 0 to 1, yielding sets of weights denoted by (w11, w12, w13), (w21, w22, w23), and (w31, w32, w33).

With the similarity scores obtained, we proceed to multiply them with the initial static embeddings of the words. Finally, this operation yields the contextual embeddings, represented by ymoney, ybank, and ygrows, as depicted in the figure.

Some important points about above discussion :

The operations described above are performed in parallel, as illustrated in the below figure.

In the described process, parallel processing offers advantages, but it comes with a drawback: the loss of sequence information. However, we’ll explore how to address this issue because the benefits of parallel processing outweigh the loss of sequence information.

Notably, this approach involves no parameters during training, meaning the model does not learn from the data. For instance, consider an English-to-Hindi translation model. It takes an English sentence, generates embeddings, passes them through a self-attention block, and produces a general contextual embedding, independent of the specific translation task.

However, using such general-purpose contextual embeddings can lead to problems in task-specific contexts. For example, translating the phrase ‘a piece of cake’ into Hindi might yield ‘केक का टुकड़ा,’ which is idiomatic and conveys ‘बहुत आसान कार्य.’ To address this, we introduce weights and biases in the self-attention mechanism to generate task-specific embeddings, ensuring the model captures context more accurately.

In the diagram below, we observe that while we cannot introduce weights and biases directly in the softmax part, we can do so in steps 1 and 3, which involve the calculation of dot products. See below diagram.

Let’s take a close look at the diagram below to understand where and how we introduce weights and biases in the dot product operation.

After carefully analyzing the diagram, we made a fascinating discovery: each word’s embedding, such as ‘money,’ ‘bank,’ or ‘grows,’ serves not just one, but three distinct roles. Let me break it down for you.

Imagine we’re focusing on the word ‘money,’ aiming to generate its contextual embedding. Notice how its embedding appears in three different colors: green, pink, and blue. Similarly, when we look at ‘bank’ or ‘grows,’ their embeddings also manifest in these three forms.

So, what exactly are these roles? Let’s explore. In the first instance, where we’re extracting the contextual embedding for ‘money,’ its embedding acts as a query. It’s like asking other word embeddings, “How similar are you to me?” The same goes for ‘bank’ and ‘gross’ — their embeddings query the rest.

Now, let’s talk about the pink embeddings. They play the role of keys. When a query comes from the green embedding (say, ‘money’), the pink embedding (representing ‘money’) responds, indicating similarity. It’s like a conversation where each word checks its relevance to others.

And finally, when we calculate weights to form a weighted sum, the blue embeddings serve as values. They contribute to the overall understanding of the text, adding weight to specific words based on their importance.

You might wonder how we came up with these names. It’s actually rooted in simple computer science. Think of it like creating a dictionary in Python.

Each word (‘a,’ ‘b,’ ‘c’) has a value. When we query a word, we’re essentially asking for its value — just like in our language context.

In essence, for calculating contextual embeddings, each embedding takes on three roles: query, key, and value.

Since each word embedding serves three roles: query, key, and value. However, a problem with the current approach where the same vector is used for all three roles, that it would be better to have separate vectors for query, key, and value roles. This separation, they argue, allows each vector to perform its specific role more effectively, leading to better organization and efficiency.

Let’s compare this to something familiar, like word embedding, or how words are understood in a language. Imagine there’s a person who wrote his life story in a book. He mentioned everything about himself, even what kind of partner he’s looking for. Now, imagine he goes to a website called jeevansaathi.com to find a partner. Here, he does three things: he creates a profile, searches for potential partners, and then matches with someone. Creating a profile is like showing who you are to others. Searching is like looking for specific traits in a partner. And matching is when both people like each other’s profiles.

Now, think about if he uploaded his whole autobiography instead of making a simple profile. That’d be like asking someone to read a whole book just to get to know him! It’s too much. Similarly, in our word model, using the same embeddings for querying, searching and matching would be like using the wrong tool for the job. It’s like using a fork to eat soup — not very efficient! So, instead of using the same approach for all tasks, we need to tailor our strategy to fit each specific task. This is where the concept of having three different embeddings comes in — one for querying, one for searching, and one for matching. Each embedding is optimized for its respective task, allowing for more efficient and effective processing. That’s the essence of this idea.

Now, here a question arises : What information the person will extract from their autobiography to put in their profile. How will he decide this? The answer to this will be decided by the help of data. So, trying to understand this, what did I do at the beginning? I extracted information from my autobiography and put it in my profile. That’s what I did. I’m an author and I write books, mostly about politics, so in a month, I noticed that the girls who are requesting are very politically inclined. Like, they have very high interest in politics. But I don’t like girls who talk too much about politics. Okay, so I learned from the data which type of girls are being recommended to me. I learned from that data, and I quickly went back and removed this part from my profile that I specifically write books about politics. I just wrote that I’m an author but I removed the word “politics” from there, that I’m very interested in politics and write books about politics. Automatically, I started getting recommendations for girls who were a little neutral towards politics, not very interested in politics. So, you’ve learned in creating this profile, where you’ll put the data from, what kind of data you’ll receive similarly, what you’ll put in your search query.

It’s very simple, by making mistakes and learning from the data, you searched that I want a girl from Maharashtra who is a Maharashtrian, so you changed your preference. Okay, I’ll accept recommendations from girls from Maharashtra only. And you got a request from Maharashtrian girls and you felt like, “No, I don’t just want to see Maharashtrian girls, I should also be seeing other girls from other states.” So, you removed your preference that, “Okay, I’ll accept recommendations from anywhere in India, from any girl.”

So, again, you learned from the data, similarly, you started writing in the beginning that I want working and non-working girls both types of girls you can recommend. But when you started getting requests, you realized that my tuning is more with those who are working girls, who work. So, again, you modified your search query and you said, “Suggest me girls who are working in the same profession.”

So, what are you doing? You’re changing your search based on whatever is happening with you in the data. So, here also you have to do the same work.

Similar to the above example discussion, we need to made three vectors which are query, key and value for each word from the initial embedding of that word by doing mistakes on data and then learns from it.

Building three new vectors (query, key and value) based on Embedding vector :

Since we already have the embedding vector for each word, our task is to generate three new vectors from it. We have two options for this: one is scaling, which only changes the magnitude but not the direction of the vector. The other option is a linear transformation, where we multiply the vector by a matrix of a certain dimension. We need three different matrices for this purpose: one for generating the query, one for the key, and one for the value. Initially, these matrices contain random values, but through learning from the data and adjusting these values using backpropagation to minimize loss, we refine them. Once the training is complete, we use these matrices to multiply with the embedding vectors, resulting in three new vectors.

One notable point is that the three matrices generated during training are the same for each word to produce the query, key, and value vectors.

We can simultaneously(parallel) process each word to create dynamic contextual embeddings for them. Take a look at the diagram below and try to gain insights from it.

Scaled Dot Product Attention :

In the seminal research paper “Attention is All You Need,” the authors introduced the concept of scaled dot product attention. This attention mechanism involves computing the dot product between a query vector and a set of key vectors, divided by the square root of the dimensionality of the key vectors (√dk).

Here, the question arises as to why the division by the square root of the dimensionality of the key vectors (√dk) is performed in the scaled dot product attention mechanism ?

Let me explain this as following :

When we perform a dot product between query and key vectors, we obtain a matrix. It’s noteworthy that when the dot product involves low-dimensional vectors, the resulting matrix has lower variance. Conversely, if we use higher-dimensional query and key vectors, the resulting matrix exhibits higher variance.

The issue lies in the behavior of the softmax function when applied to these matrices. Softmax tends to assign very high probabilities to larger values compared to smaller ones. Consequently, during training, the model tends to prioritize reducing the larger values, neglecting the smaller ones, leading to unstable training. This problem is particularly pronounced with matrices of higher variance.

Let me take an example to explain this : Imagine a classroom where students vary significantly in height. When the teacher asks the class to raise their hands to ask questions, taller students, with their hands higher up, are more likely to catch the teacher’s attention. However, because their hands are raised higher, they may unintentionally obscure the raised hands of shorter students sitting behind them. Consequently, the teacher might not notice the shorter students’ hands, and their questions may go unanswered.

This situation reflects the problem encountered in scaled dot product attention: the dominance of larger values (taller students) can overshadow and neglect the importance of smaller values (shorter students), leading to an incomplete understanding of the overall class’s questions or concerns. By using scaled dot product attention, akin to adjusting the criteria for raising hands based on students’ heights, the teacher can ensure that all students, regardless of their height, have an equal opportunity to ask questions and have their concerns addressed.

Similarly to mitigate this, scaled dot product attention is introduced, where we divide the dot product by √dk, effectively reducing the variance.

Geometrical Intuition of Self Attention :

Imagine you have the phrase “money bank” and you want to represent it geometrically after generating its embedding using a simple word2vec model. Let’s say this embedding is a 2-dimensional vector. In a geometric representation, this vector would be like a point in a 2D space, where each dimension represents a different aspect or feature of the words “money” and “bank”. So, you can imagine plotting this point on a graph, where one axis represents one aspect of the word and the other axis represents another aspect.

Now, let’s generate query, key, and value vectors for each word using matrix dot multiplication. Suppose each vector is 2-dimensional. Their geometrical representation can be visualized as follows :

In the vector diagram below, observe that the angle between the “money” key vector and the “bank” key vector is greater compared to the angle between the “bank” key vector and the “bank” query vector. This indicates that the dot product is higher in the second case, implying greater similarity compared to the first case.

After conducting various operations, we obtain the value vector, which is visually depicted in orange in the diagram below.

By applying the parallelogram law of vector addition, we arrive at the resultant vector ybank, which closely aligns with the embedding vector for “money.” This dynamic and task-specific contextual vector encapsulates the essence of the relationship between “money” and “bank.”

So, this is the geometrical representation of self attention.

Why is self attention called “Self” ?

Self-attention is called “self” because it allows the model to focus on different parts of the input sequence by relating each element to other elements within the same sequence, rather than relying on external information. The “self” in self-attention refers to the mechanism’s ability to capture relationships and dependencies within the input sequence itself.

References :

Research Paper : Attention is all you need
Youtube Video : https://youtu.be/-tCKPl_8Xb8?si=JG828MAlemg4JHAE

I trust this blog has enriched your understanding of the self attention. If you found value in this content, I invite you to stay connected for more insightful posts. Your time and interest are greatly appreciated. Thank you for reading!