Self-Attention Mechanism In Transformers

Understanding the Core of Modern Language Models

8 min readJul 19, 2024

Introduction

In this blog, we will delve deeper into the self-attention mechanism in transformers. Have you ever wondered how models like ChatGPT provide coherent and contextually relevant answers to very long prompts? The secret lies in the fascinating mechanism called self-attention.

Background

Before diving into self-attention, it is important to understand its context and usage. Self-attention is a key component of transformers, a type of neural network architecture. Transformers differ from traditional neural networks as they process sequences of inputs and outputs, making them highly effective for sequence-to-sequence tasks.

The evolution of sequence processing began with recurrent neural networks (RNNs), which were designed to handle sequential data but often suffered from vanishing gradient problems. To address these issues and maintain important information across long sequences, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks were developed. However, these models processed inputs sequentially, limiting their ability to capture relationships between distant tokens within the input sequence.

This is where transformer architecture revolutionized the field. Transformers can take multiple tokens as input simultaneously and return a sequence of tokens, effectively capturing the relationships between all tokens in the input. The self-attention mechanism is crucial in achieving this capability, allowing the model to weigh the importance of each token relative to others in the sequence.

What is self-attention?

In simple terms, self-attention layers find connections or similarities between input tokens. Although we input tokens, the model doesn’t understand any language or its meanings. It only understands numbers or vectors, so we need to convert the tokens into vectors.

How Do You Convert Tokens into Vectors, and How Does It Help?

Instead of using the term “vectors,” we use “embeddings.” Word embeddings generate vector representations for each word. These embeddings give the model an understandable meaning for words, allowing it to identify similar-meaning words.

What self-attention does using these embeddings?

The primary goal of self-attention is to adjust the embeddings of each token based on its context within a sentence. Consider the word “bat,” which has different meanings depending on the context:

Bats are essential for controlling insect populations.
He swung the baseball bat with precision.

The embedding for “bat” needs to be modified according to its usage in each sentence. The transformation process is illustrated below, where E represents the input embedding and E′ represents the output embedding of the self-attention layer.

Although the concept might seem magical, it is fundamentally grounded in mathematics, particularly matrix multiplication. Self-attention involves determining the relationships and their weights between tokens. Each input token is associated with three key components:

Query(Q): Think of it as the embedding asking the rest of the embeddings in the input, “Is there anyone defining me or connected with me?”

Key(K): The key responds to these queries, indicating connections with responses like “Yes, I am” or “Yes, I am connected with you.”

Value(V): The value provides the actual data associated with the key. Essentially, the value represents the exact content that will be used to update the embeddings based on the relationships identified by the queries and keys.

Let’s delve into how these components contribute to the self-attention mechanism.

How does self-attention work?

The primary objective of self-attention is to update each token’s embeddings according to its context. To achieve this, we need to follow several steps: identifying the attention pattern, masking, normalization, and finally determining the change amount in the word embedding.

How Are the Query, Key, and Value Vectors Computed?

Let’s take the phrase “a good novel” as an example. First, we tokenize the phrase, and then we generate word embeddings for each token, shown in blue.

What Are Wq, Wk, and Wv Matrices?

These are learned weighted metrics. They are parameters of the transformer model optimized during the training.

Wq: Query Weight Metrix → Green color
Wk: Key Weight Metrix → Yellow color
Wv: Value Weight Metrix → Red color

These learned weighted matrices are used to calculate Query, Key, and Value vectors by multiplying each input embedding individually. Finally, we will have three separate vectors for each token:

Query vector : Green color
Key vector : Yellow color
Value vector : Red color

How to Extract the Attention Pattern?

Now we have all the required vectors. The next step is to perform a dot product between the query and key vectors to get scores.

So in this, the most similar vectors will produce a higher value and the least similar vectors will yield lower values.

Scaling the result stabilizes the gradients during training, where dk is the dimension of key metrics.
There is a problem in the above calculation shown in the image.
When the language model predicts the next token it shouldn’t consider the future tokens, only the previous tokens.
This is where masking comes in.

How to do the masking?

To solve this, we need to avoid the dot product between vectors below the diagonal by replacing those values with −∞ . This method ensures that during normalization, these values become 0 after applying the softmax function.

How to Normalize the Values?

To normalize the values, we use the softmax function to bring the column values between 0–1. The −∞ values will become 0 after applying Softmax, which gives the attention weights. These weights indicate the importance of each token with respect to the current token.

Determining the Change in Embedding

The main goal of self-attention is to update the token embeddings. The attention weights show the most relevant tokens for a particular token, identified as the weights in that token’s column.

The value vector of each token contains the exact data about those tokens. Adding one token’s embedding with another token’s value vector results in a new embedding that contains both tokens’ attributes according to the weight used.

We calculate the weighted sum by multiplying the value vectors and the attention pattern. This process adds other tokens’ information to the embedding according to its weightage.

The final output gives the updated embeddings of each token.

Now let’s see how the entire process is given in a function in the Attention Is All You Need paper.

computing the attention score from the dot product of Q and K.
Scaling the scores by dividing by the square root of dk.
Applying the softmax to normalize.
Finally, calculate the weighted sum by multiplying by value (V).

The entire process we have done until now is the Scaled Dot-Product attention or single head of attention

A single-head processes sequentially and may not capture different aspects, making it less efficient.

To resolve this, we use multi-head attention, which allows multiple heads to operate in parallel. Each head processes the headsize of the input sequence across all heads simultaneously, with each head containing different learned weighted metrics (Wq, Wk, and Wv). This enables each head to focus on different aspects of the relationships between input words.

After processing each head individually, the updated new embedding for each token is calculated by adding the outputs of each head.

source : 3Blue1brown YouTube channel

Now let’s see an example: Consider the input “Attention is a novel idea.”. After the tokenization step, the tokens will be [CLS], attention, a novel idea, and [SEP]. Let's see how the single-head and multi-head attention layers find relationships between tokens.

01. single-head attention

If we go deeply, the below image shows the connection between “attention” and other tokens in the single-head attention.

02. Multi-head attention: 6 heads

If we go deeply, the below image shows the connection between “attention” and other tokens in the multi-head attention.

Now, I hope you have a clear idea of how this self-attention works.

What are the Advantages of this Self-Attention Mechanism?

capturing long-ranging relationships:

RNN, LSTM, and GRU struggled to capture long-ranging relationships because of the vanishing gradient problem and sequential process. However, self-attention directly processes all the tokens simultaneously. So they will capture long-ranging relationships efficiently.

Flexible context length :

Self-Attention dynamically adjusts the context size based on the attention weights. This allows the model to focus on the most relevant parts of the sequence, enhancing its ability to understand and generate contextually appropriate responses.

Conclusion

The self-attention mechanism is a cornerstone of modern language models like transformers, enabling them to process and understand complex sequences of text effectively. By capturing long-ranging relationships and dynamically adjusting the length of the context, self-attention allows these models to generate coherent and contextually relevant responses. Understanding this mechanism not only provides insights into how models like ChatGPT work but also opens up possibilities for further advancements in natural language processing.

Thank You