Self-Attention: A step-by-step guide to calculating the context vector

7 min readOct 16, 2023

Introduction

In recent years, self-attention has revolutionized the field of natural language processing (NLP). One of the most recent breakthroughs in NLP that was enabled by self-attention is the development of transformer models. Transformer models are now the state-of-the-art for many NLP tasks, including:

· Machine translation

· Question answering

· Text summarization

· Natural language inference

In this blog post, we will provide a high-level overview of self-attention and how it works. We will also include a walkthrough of how to calculate the attention scores for a given sentence.

Self-Attention — Definition

Self-attention is a technique used to create vector representations for words. This final representation is a key factor in helping machines understand how different words in a sentence relate to each other. It operates on sentences represented as word embeddings, which are essentially numerical representations of words. Self-attention’s role is to transform this sequence into a new representation while preserving the original word order, making it a fundamental component in powerful AI models like the Transformer.

Exploring the Magic of Self-Attention Scores

Self-attention works by calculating attention scores between pairs of words. Using the sentence “I am going to play”, let’s understand how attention scores are calculated. The pairs of words considered are shown in the figure.

An explanation for the content can also be found in the following YouTube link.

https://youtu.be/a4L43vwNI_o

Calculating Self-Attention Scores

To calculate self-attention scores, we start with an input vector

for which the score is calculated. If n is sequence length, the self-attention score is calculated between

This score is calculated by comparing

where j can take values from 1 to n, representing the entire sequence length.

Step 1 : Creating Query (Q), Key (K), and Value (V) Vectors

The input to the self-attention mechanism consists of the embedding vectors of each input token. Self-attention relies on three special vectors: Query (Q), Key (K), and Value (V) vectors for each input token. These vectors are derived from the embedding vector through a clever process.

We begin with the embedding vector, and during the training process, the input data is passed through three separate fully connected layers to generate the three matrices: WQ, WK, WV. A sample linear feedforward layer is given. The first layer of w nodes of the input vector and w’ nodes in the second layer results in a wxw’ matrix termed WQ between the two layers. We can decide the value of w’.

These matrices are updated during the training process. In the original paper, the task employed for training was machine translation. These matrices help us obtain three crucial vectors:

where w’ is smaller than w. The new latent embeddings for each word embedding , are

Here,

When these matrices are multiplies, the result will be the vector,

Every word in the input sequence has an associated qt, kt and vt matrix.

Step 2 : Calculate Scaled-Dot Product Attention

It’s a fundamental component of self-attention and is often referred to as the “scaled-dot product attention” because it scales the dot product by the square root of the dimension ‘w’ (which is typically denoted as dK).

The scaled-dot product attention can be generalized as :

This scaling helps control the magnitude of the attention scores, making them more manageable and interpretable.

Step 3 : Create a Vector of Probabilities

To normalize the attention scores and convert them into a probability distribution, we apply the softmax function, as described in the equation below:

This results in a vector of probabilities of size n which is the attention score between

At time step 0, the following calculations are made.

Step 4 : Compute the Context Vector

The context vector summarizes the information from all the words in a sequence with respect to a specific word or token at position ‘t’. This context vector captures the context or information from the entire sequence in a way that emphasizes the words that are most relevant to the word at position ‘t’.

Example

Input sequence: “I am going to play”

Query: “play”

Keys: “i”, “am”, “going”, “to”, “play”

Values: “i”, “am”, “going”, “to”, “play”

Attention scores: [0.2, 0.1, 0.2, 0.1, 0.4]

Context vector: [0.2 * “i” + 0.1 * “am” + 0.2 * “going” + 0.1 * “to” + 0.4 * “play”]

The context vector for the word “i” is a weighted sum of the representations of the other words in the sequence, weighted by their attention scores. The attention scores indicate that the most important words for the word “play” are “i”, “going”, and “play”.

The context vector is a vital component of self-attention and is used to generate context-aware representations for each word in the sequence, allowing the model to understand and relate words to each other in a contextually meaningful way. The following figure summarizes the steps.

Step 5 : Generalizing Self-Attention for Scalability

We can generalize this process by considering ‘Q’ as a set of {q1, q2, … qn}, Kε{k1, k2, … kn}, Vε{v1, v2, … vn}. This concept allows us to efficiently process sequences of different lengths and dimensions. The generalized equation for the Attention function is:

It’s worth noting that when we use an encoder-decoder model, in the encoder, we calculate ‘Q,’ ‘K,’ and ‘V’ for the input sequence, while in the decoder, we compute them for the output sequence. This dynamic usage of ‘Q,’ ‘K,’ and ‘V’ is a fundamental concept in self-attention and a crucial part of the Transformer model.

Clarifying the Roles in Self-Attention

In the self-attention mechanism, there are three important roles:

Query (Q): Think of the query as a representation of a word at a specific time step, let’s say ‘t.’ It’s like a question that checks for compatibility with other words in the sequence.

Key (K): The key is the token to which we are checking compatibility with the query. It’s like the answer to the question posed by the query.

Value (V): The value is the actual representation vector of the token. It’s like the meaningful information or content associated with a word.

You can view this process as querying the keys to validate the semantic relationships between them. The matrix ‘Z’ is responsible for embedding these relationships between query-key pairs in a latent space. This latent space can capture various kinds of relationships, such as gender or any other semantic connections.

The Latent Space

The latent space in self-attention is a flexible concept that can capture various types of semantic relationships. For instance, in the context of gender, the self-attention mechanism can be used to understand and represent the semantic relationships between words like “man” and “woman.” This means that the mechanism is not limited to a single type of relationship but can adapt to different semantics and concepts, making it a powerful tool for understanding language and context.

It’s also important to note that in some cases, the key and value vectors may be the same, simplifying the self-attention process for certain problem statements.

Illustrating Self-Attention with a Concrete Example

Let’s break down the self-attention process step by step using the example of the sequence “Playing Outside” and a hypothetical vocabulary size of 128:

Here are the sample vectors we have assumed for both the words:

Playing

q1 = [0.212 0.04 0.63 0.36]

k1 = [0.31 0.84 0.963 0.57]

v1 = [0.36 0.83 0.1 0.38]

Outside

q2 = [0.1 0.14 0.86 0.77]

k2 = [0.45 0.94 0.73 0.58]

v2 = [0.31 0.36 0.19 0.72]

The context vector can then be used for a variety of downstream tasks, such as machine translation, text summarization, and question answering.

We have delved into the core mechanism that drives the remarkable capabilities of Transformer models. Starting with a simple sentence, “Playing Outside,” and a hypothetical vocabulary size, we’ve unveiled the step-by-step process of self-attention, from tokenization to the creation of query (q), key (k), and value (v) vectors to the creation of the context vector.