Self-Attention: The Magic Behind Transformers and the AI Revolution

7 min readSep 19, 2024

Today, we’re diving into something that’s at the core of Transformers, and Transformers themselves are at the core of this AI revolution! Meet self-attention — the secret sauce that lets models like GPT understand the relationships between words, no matter where they appear in a sentence. It’s like giving AI superpowers to focus on the important stuff, making sense of everything from language translation to creative text generation.

If I asked you, “What’s the most important task in all of natural language processing (NLP)?”, what would your answer be?

Of course, it’s converting words into numbers, right? After all, computers can’t understand words like we do — they need everything in numeric form to perform any computations. But here’s the catch: converting words to numbers in a meaningful way isn’t as straightforward as it seems.

The Problem with Traditional Word Representations

Initially, the simplest way we represented words as numbers was through techniques like one-hot encoding. Each word was assigned a unique vector of 1s and 0s. While this approach works, it completely ignores relationships between words. For example, “dog” and “cat” would be as unrelated as “dog” and “table” in this representation, despite “dog” and “cat” being semantically closer.

To address this, we developed word embeddings like Word2Vec, which represented each word as a vector in a continuous space, capturing semantic relationships like “king” being closer to “queen” than to “apple.” But the problem with word embeddings was that it doesn’t really capture the contextual dynamic meaning rather static average meaning like no matter if we are saying “river bank” or “money bank” bank would have the same vector.

The Need for Context

But here’s the thing: words change meaning depending on the context. Think about the word “bank”:

“I went to the bank to deposit money.”
“We sat by the river bank.”

In both sentences, the word “bank” is used, but its meaning is completely different. Traditional word embeddings can’t capture this change in meaning. They assign one fixed vector to “bank,” even though the word means different things in different contexts.

How Self-Attention Works from First Principles

To truly capture the context of words in a sentence, we needed a way to represent each word based on its relationship with every other word in the sentence. In simple terms, we needed to weigh how important other words are when interpreting a given word.

Instead of assigning a fixed meaning to each word (as we did with traditional embeddings), we began representing each word as a weighted sum of all the other words in the sentence. The key idea here is that we dynamically adjust these weights based on how relevant each word is to the word we’re focusing on.

This idea of dynamically focusing on diff parts of sequence/sentence is why it is referred to as “attention”.

In self-attention, we start with the old word embeddings, which are the initial vector representations of the words in the sentence. But we don’t stop there. The goal of self-attention is to generate new, context-aware embeddings for each word by making each word a weighted combination of all the other words in the sequence.

So, how do we calculate these weights?

These weights represent how much each word should “pay attention” to every other word in the sentence, and they are based on similarity scores between the words. To calculate these similarity scores, we use the dot product.

Imagine we are trying to calculate a new embedding for a specific word (let’s call it Word A). To do this, we compare Word A with all the other words in the sentence, including itself.

For each comparison, we take the dot product of Word A’s embedding (its query) with the embeddings of the other words (their keys). This gives us a similarity score — essentially a measure of how related Word A is to each of the other words.

These similarity scores tell us how much weight each word should have in influencing the final representation of Word A. If two words are very related (high dot product), the weight will be high, meaning that word will strongly influence Word A’s final embedding. If they’re less related (low dot product), the influence will be smaller.

Now let’s take an example

we have the sentence “bank grows money”, and we want to generate a new embedding for the word “bank” using self-attention. Here’s how the process unfolds:

Old word embeddings: We start with the original word embeddings for each word in the sentence:

bank: E_bank
grows: E_grows
money: E_money

These embeddings are just vectors representing each word, but they don’t yet capture context.

2. Calculating similarity (dot product): To generate a new embedding for “bank,” we calculate how similar “bank” is to itself, “grows,” and “money.”

Similarity between “bank” and “bank”:We take the dot product of E_bank with itself. This will be a high score since it’s comparing "bank" to itself.
Similarity between “bank” and “grows”:We take the dot product of E_bank with E_grows. This score will tell us how relevant "grows" is for understanding "bank."
Similarity between “bank” and “money”:We take the dot product of E_bank with E_money. This will tell us how much attention "bank" should pay to "money."

3. Generating weights (softmax): The dot products give us raw similarity scores for each pair. But to make these scores more interpretable, we pass them through the softmax function, which converts them into weights that sum to 1.

Let’s assume the following weights result:

Weight for bank itself: 0.6
Weight for grows: 0.2
Weight for money: 0.2

These weights tell us how much attention “bank” should give to each word.

4.Weighted sum of the values: Now that we have weights, we combine the original embeddings of each word to form the new embedding for "bank."

New embedding for "bank" = (0.6 * E_bank) + (0.2 * E_grows) + (0.2 * E_money)

This new embedding is context-aware: it doesn’t just represent the word "bank" on its own; it now also incorporates the influence of "grows" and "money," based on how relevant those words are.

Connecting the Dots: Queries, Keys, and Values

Now you can see that the same word vector, like E_bank, or any other word embedding, is used multiple times during the calculation of the new embedding. But here's the interesting part: these aren't the exact same vectors being reused in different steps.

To make self-attention work, we apply learned linear transformations to each word’s embedding, generating three distinct “flavors” of the same word embedding:

Query (Q): The vector that represents what each word is looking for in other words.
Key (K): The vector that represents the information each word has that others can use.
Value (V): The actual content or meaning that each word contributes to the final representation.

For example, for the word “bank”, we generate:

Q_bank: The query that asks, "What am I looking for in other words to understand 'bank'?"
K_bank: The key that represents what the word "bank" offers to other words.
V_bank: The value, which is the meaning "bank" contributes after the attention process.

This transformation is the first step in making self-attention work.

The Full Mechanism: Step by Step

Let’s now walk through how this transformation unfolds for each word:

Generate Queries, Keys, and Values: For each word in the sentence (“bank,” “grows,” “money”), we generate three vectors: a query (Q), a key (K), and a value (V). These vectors are derived from the original word embeddings by applying learned transformations (matrices).

2. Calculate Attention Scores:

To figure out how much attention each word should give to other words, we calculate a similarity score by taking the dot product of the query of one word and the keys of all the other words.

For instance, to compute how much attention “bank” should pay to “grows” and “money,” we compute the dot product of Q_bank with K_grows and K_money. The result gives us a score for how relevant those words are to "bank."

3. Scale and Normalize the Scores:

To make these similarity scores easier to work with, we scale them by dividing them by the square root of the dimension of the key vectors. This step prevents the dot product scores from getting too large and helps stabilize training.
Then, we pass these scaled scores through the softmax function to normalize them. This gives us the attention weights, which tell us how much influence each word should have in calculating the final representation of “bank.”

4. Weighted Sum of Values:

Now that we have the attention weights, we calculate the new embedding for “bank” by taking the weighted sum of the values for all the words in the sentence.

Each word’s contribution is scaled by its attention weight. So, if “grows” has a higher weight compared to “money,” the new embedding for “bank” will give more importance to the information coming from “grows.”

Mathematically, the new embedding for “bank” looks like this:

New Embedding for “bank”=∑(attention weight)×(value vector)

This process happens in parallel for every word in the sentence.

Why Self-Attention is Powerful

What makes self-attention powerful is its ability to dynamically adjust which parts of a sentence are important for understanding each word. Unlike earlier approaches like recurrent neural networks (RNNs), which process sentences sequentially and may struggle with long-range dependencies, self-attention allows each word to directly access and incorporate information from all other words in the sentence in parallel.

This parallelism, combined with the ability to focus on relevant words dynamically, is what makes self-attention so effective for tasks like machine translation, text summarization, and even models like Transformers, which revolutionized NLP.