Understanding Transformers (Part 4)

A Clear View On All Attention Mechanisms

5 min readMay 21, 2024

Attention

Analogy: You (the professor) ask each research assistant (RA) to find the most relevant information from the documents based on your specific question.

Queries (Q):

Intuition: The specific question or topic you want to find information about.
Analogy: You ask an RA, “Find information about neural networks.”
In Transformers: Each input token generates its own query vector through a learned linear transformation. This query represents what the token is seeking.

Keys (K):

Intuition: Descriptors that help identify what information is contained in each document.
Analogy: Each document has a summary or a set of keywords like “neural networks, deep learning, AI.”
In Transformers: Each input token also generates its own key vector through another linear transformation. This key represents the content of the token.

Values (V):

Intuition: The actual information or content that the RA will use if the document is found relevant.
Analogy: The value is the full content of the book or document.
In Transformers: Each input token generates a value vector through a third linear transformation. This value is the actual content to be attended to.

Process:

Each RA matches your query against the document summaries (keys).
They compute a score for each document based on the match.
They use these scores to weigh the importance of the content (values) from each document.
Finally, they compile a summary based on the most relevant content.

Self-Attention

Analogy: Each RA independently looks at all the documents, comparing each document to all other documents to find the most relevant information for every part of your question.

Example:

Suppose your query is a document with sections asking about “neural networks,” “training methods,” and “applications.”
Each section of the query is compared with all sections of all documents.
The RA dynamically determines which sections of the documents are most relevant to each part of the query.

In-Depth Process:

Generate Q, K, V:

Each word in a sentence generates its own Q, K, and V vectors.
For example, for the sentence “The cat sat on the mat,” the word “cat” generates its own Q, K, and V vectors.

Compute Similarity Scores:

The query vector for “cat” is compared with the key vectors of all words, including itself.
Scores indicate how relevant each word is to “cat.”

Calculate Attention Weights:

Apply a softmax function to these scores to get weights.
These weights determine how much attention “cat” should give to each word.

Weighted Sum of Values:

The final representation of “cat” is a weighted sum of the value vectors, where the weights come from the attention scores.

Multi-Head Attention

Analogy: You have multiple RAs, each with a different area of expertise. They independently perform the attention process and then combine their findings to provide a comprehensive answer.

Example:

RA1 is an expert in neural network architectures.
RA2 is an expert in training methods.
RA3 is an expert in applications of neural networks.
Each RA performs the attention process on the documents independently.
They then combine their findings to give you a detailed and nuanced summary.

In-Depth Process:

Multiple Heads:

Suppose we use 3 heads. Each head has its own set of linear transformations to generate Q, K, and V.

Independent Attention:

Head 1 might focus on subject-verb relationships.
Head 2 might focus on positional relationships.
Head 3 might focus on noun-adjective relationships.

Combine Outputs:

Each head produces its own context-aware representations.
These representations are concatenated and passed through a final linear layer to produce the final multi-head attention output.

Cross-Attention

Analogy: You have two groups of RAs: one group summarizes a set of input documents, and the other group generates a report based on your specific questions, focusing on the summaries provided by the first group.

Example:

Group 1 RAs summarize the input documents (encoder).
You provide specific questions to Group 2 RAs (decoder).
Group 2 RAs use the summaries from Group 1 to generate a report, focusing on the most relevant parts of the summaries.

In-Depth Process:

Encoder Output:

The encoder processes the entire input sequence and produces a set of encoded representations (one for each token).

Decoder Queries:

The decoder generates queries from its previous outputs (e.g., starting with the start token).

Cross-Attention Calculation:

The decoder queries are compared against the encoder keys (encoded representations).
The similarities determine the weights assigned to the encoder values.
The weighted sum of encoder values produces context-aware representations for the decoder to generate the next token.

Detailed Example for Each Concept

Attention Mechanism:

Sentence: “The cat sat on the mat.”
Query: “sat” (looking to translate this word).
Keys: Contextual words (“The,” “cat,” “sat,” “on,” “the,” “mat”).
Values: Actual content (embeddings of these words).

Process:

Compute similarity between “sat” and each word.
Apply softmax to get attention weights.
Multiply these weights with the embeddings to get a context-aware representation of “sat.”

Self-Attention:

Sentence: “The cat sat on the mat.”

Process:

Each word attends to all words, including itself.
For “cat”:
Query from “cat.”
Keys from all words.
Values from all words.
Compute similarities, apply softmax, get weighted sum, update “cat’s” representation.

Multi-Head Attention:

Sentence: “The cat sat on the mat.”

Process:

Use multiple attention heads to capture different aspects.
Head 1: Focuses on subject-verb.
Head 2: Focuses on positional.
Head 3: Focuses on noun-adjective.
Combine outputs of all heads for a richer representation.

Cross-Attention:

Input Sentence (English): “The cat sat on the mat.”
Output Sentence (French): “Le chat s’est assis sur le tapis.”

Process:

Encoder processes the English sentence, produces encoded representations.
Decoder starts generating the French sentence, uses cross-attention to focus on the encoder’s output while producing each word.
For “Le” (initial word in decoder):
Query from decoder’s previous state.
Keys from encoder’s output.
Values from encoder’s output.
Compute similarities, get weighted sum, generate next word “chat.”

Summary:

Attention Mechanism: An RA focuses on finding relevant parts of the documents for a given query.
Self-Attention: Each RA considers all parts of the input simultaneously to understand context.
Multi-Head Attention: Multiple RAs, each with a different focus, analyze the input independently and combine their results for a richer understanding.
Cross-Attention: Decoder RAs use summaries provided by encoder RAs to generate a coherent and contextually relevant output.

These mechanisms together enable transformers to effectively process and generate sequences with rich, context-aware representations.

Understanding Transformers (Part 5)

Putting It All Together

medium.com

Understanding Transformers (Part 4)

A Clear View On All Attention Mechanisms

Attention

Queries (Q):

Keys (K):

Values (V):

Process:

Self-Attention

Generate Q, K, V:

Compute Similarity Scores:

Calculate Attention Weights:

Weighted Sum of Values:

Multi-Head Attention

Multiple Heads:

Independent Attention:

Combine Outputs:

Cross-Attention

Encoder Output:

Decoder Queries:

Cross-Attention Calculation:

Detailed Example for Each Concept

Attention Mechanism:

Self-Attention:

Multi-Head Attention:

Cross-Attention:

Summary:

Next:

Understanding Transformers (Part 5)

Putting It All Together

Written by Sebastien Callebaut