Demystifying Transformers: Attention Mechanism

Representing Word Meanings Based on Their Context

6 min readFeb 13, 2024

This article is part of the series Demystifying Transformers.

Introduction

The Transformer architecture has revolutionized Natural Language Processing, and at the core of its brilliance lies the attention mechanism. But what is attention, and why does it make such a difference? Let’s unravel it all through the lens of a simple word: “apple”.

Attention: It’s All About the Context

Think about the word “apple”. It can instantly conjure images of a juicy red fruit or the iconic tech company. Our brains quickly identify the right meaning based on the surrounding context. Similarly, the attention mechanism in Transformers allows models to zero in on the most relevant parts of a sentence when processing information.

Consider these sentences:

“I picked a ripe apple from the tree.”
“The new Apple smartphone has an amazing camera.”

In the first sentence, “apple” clearly refers to the fruit. In the second, it unambiguously points to the company. Attention helps a Transformer-based model make this distinction by focusing on the contextual cues within each sentence. Visually, you can think of words as vectors in a high-dimensional space, and the word “apple” is pulled towards either the fruit side or the company side depending on the context.

Why Attention Matters

Contextual Understanding: Attention tackles the complexities of language where single words carry multiple meanings. It sharpens the model’s understanding by weighing contextual clues derived from other words in the sentence.
Beyond Sequence Bottlenecks: Traditional models sometimes struggle with long sentences and paragraphs, as they try to cram everything into a single fixed-size representation. Attention makes Transformers better at handling long-range dependencies. They can directly connect with any part of the input, regardless of the distance between words.
Efficiency and Explainability: Unlike models that process text step-by-step, Transformers with attention can process multiple words simultaneously. This speeds up training and helps us sometimes get a glimpse into which words hold the most weight for the model in its decision-making.

Word Embeddings and Context

Before diving into the calculations behind attention, let’s understand its conceptual foundation. The attention mechanism can be elegantly described as “associating words with their context using embeddings.”

Words as Embeddings: In the world of NLP, words aren’t treated as mere text. They are transformed into numerical vectors called embeddings. These embeddings are special because words with similar meanings tend to have embeddings that are close together in the vector space.
The Power of Similarity: The attention mechanism strategically leverages this similarity between embeddings. When focusing on a particular word (the query), it compares its embedding to the embeddings of all other words in the sentence (the keys). The higher the similarity between the query and a key, the more relevant that word is considered to the current context.
Building Context-Aware Representations: Based on these similarity scores, the attention mechanism assigns weights to each word. These weights dictate the degree to which each word contributes to understanding the overall context of the target word. This creates a powerful new representation of the target word — one that’s shaped by its surrounding words.

How Attention Works

The attention mechanism relies on three key concepts:

Queries (Q): A query is like a search request. It asks, “What information is most relevant to the word I’m currently looking at?”
Keys (K): Keys represent the potential information offered by each word in the sentence.
Values (V): Values contain the actual content associated with each word.

Attention calculations involve matching the query with keys and using those matches to selectively pull information from the values. Let’s explore the actual calculation used to determine attention weights. A common technique is called scaled dot-product attention:

Attention(Q, K, V) = softmax(QK^T / √dk) V

Let’s break this formula down:

Q (Queries), K (Keys), V (Values): These represent matrices containing the embeddings for our queries, keys, and values. Recall that words or tokens become their respective embeddings!
QK^T: This signifies the dot product between the query matrix (Q) and the transpose of the key matrix (K^T). Dot products intuitively measure similarity between vectors.
√dk : This is a scaling factor where ‘dk’ is the dimension of the key vectors. It prevents extremely large dot product values from dominating the results.
softmax: The softmax function takes the scaled dot products and converts them into probabilities summing to 1. These become the attention weights that signify the importance of each word in the context.
V: Finally, the calculated attention weights are multiplied with the value matrix (V), ultimately leading to a weighted combination of the value vectors. This output is the new enriched representation of the target word that has absorbed contextual information.

In essence, the attention mechanism tries to represent the query (Q) as a weighted linear combination of value vectors (V), and the weights are determined by the similarity between the query and keys (K). For example, in the sentence “Apple announced their newest phone”, the word “apple” will be represented as a weighted linear combination of “apple”, “announced”, “their”, “newest” and “phone”, and the weights are determined by their similarity with “apple” in the embedding space.

Python Example from Scratch

Let’s implement a basic version of the attention mechanism in Python. This example will simplify some aspects but will give you a clear idea of how attention works. The code is available in this colab notebook.

import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=-1, keepdims=True)

def attention(Q, K, V):
    d_k = Q.shape[-1] 
    scores = np.dot(Q, K.T) / np.sqrt(d_k)
    weights = softmax(scores)
    attention = np.dot(weights, V)
    print(f"Q shape: {Q.shape}")
    print(f"K shape: {K.shape}")
    print(f"V shape: {V.shape}")
    print(f"scores shape: {scores.shape}")
    print(f"weights shape: {weights.shape}")
    print(f"attention shape: {attention.shape}")
    print(f"weights: {weights}")
    return attention

# Set a seed for reproducibility
np.random.seed(1)  

# Define words and their synthetic embeddings
words = ["I", "love", "to", "eat", "a", "juicy", "apple", "pie", "announced",
         "pie", "their", "newest", "phone", "today"]
embedding_dim = 5
embeddings = {word.lower(): np.random.rand(embedding_dim) for word in words}

# Sentences with the word "apple"
sentence1 = "I love to eat a juicy apple pie"
sentence2 = "Apple announced their newest phone today"

# Calculate attention for each instance of "apple"
for sentence in [sentence1, sentence2]:
    tokens = sentence.lower().split()
    word_embeddings = [embeddings[token] for token in tokens]

    for i, word in enumerate(tokens):
        if word == "apple":
            query = word_embeddings[i]
            key = np.array(word_embeddings)
            value = np.array(word_embeddings)
            attention_output = attention(query, key, value)

            print(f"\nAttention output for '{word}' in the sentence: {sentence}")
            print(attention_output)

Output:

Q shape: (5,)
K shape: (8, 5)
V shape: (8, 5)
scores shape: (8,)
weights shape: (8,)
attention shape: (5,)
weights: [0.09711107 0.11590922 0.11642289 0.11517958 0.16277854 0.10870223
 0.17062896 0.11326751]

Attention output for 'apple' in the sentence: I love to eat a juicy apple pie
[0.47233072 0.56100519 0.38018208 0.44847677 0.47360692]
Q shape: (5,)
K shape: (6, 5)
V shape: (6, 5)
scores shape: (6,)
weights shape: (6,)
attention shape: (5,)
weights: [0.20860172 0.15291935 0.13977266 0.15379221 0.14832523 0.19658883]

Attention output for 'apple' in the sentence: Apple announced their newest phone today
[0.30317505 0.57733981 0.49906677 0.60679205 0.45915741]

In this example, Q, K, and V are randomly generated matrices representing queries, keys, and values, respectively. The attention function computes the attention scores, normalizes them using the softmax function, and then calculates the weighted sum of the values. This simplified model captures the essence of how attention mechanisms enable models to dynamically focus on different parts of the input data.

Conclusion

The attention mechanism is a cornerstone of transformer models, offering a flexible, efficient way to handle sequential data. By allowing models to dynamically focus on the most relevant parts of the input, attention mechanisms facilitate a deeper understanding of context, which is crucial for complex language tasks. As we’ve seen, the underlying principles are intuitive, and even a basic implementation from scratch can illuminate how these powerful models function. The continued evolution of attention and transformer architectures promises to drive further advances in machine learning and artificial intelligence.