Understanding Transformers (Part 5)

Putting It All Together

7 min readMay 21, 2024

Transformers and Attention Mechanism

Imagine you are a professor researching a complex topic. You have a team of research assistants (RAs) to help you gather and analyze information from a large set of documents. The RAs have a magical ability called “attention,” which enables them to dynamically focus on the most relevant parts of the documents.

Attention Mechanism

The attention mechanism allows the RAs to instantly scan the entire library and pinpoint exactly which books are most relevant to any given query. They don’t need to go through books one by one or stick to predefined categories. Instead, they can dynamically determine which books (or parts of books) to focus on, regardless of their location in the library.

Components of Transformers

Transformers consist of two main parts: the encoder and the decoder. Both parts utilize the attention mechanism extensively.

1/ Encoder

The encoder reads and processes the input sequence, producing a set of encoded representations.

Process:

Input Embeddings:

Each word in the input sequence is converted into an embedding, a dense vector that captures the word’s meaning.
Analogy: Each document is represented by a detailed summary that captures its essential information.

Positional Encoding:

Since transformers do not process the input sequentially, they add positional encodings to the embeddings to retain the order of words.
Analogy: The summaries include the position of each piece of information to understand the order in which the information appears.

Self-Attention:

Each word attends to all other words in the sequence to gather contextual information.
Analogy: Each RA independently looks at all the documents, comparing each document to all other documents to find the most relevant information for every part of the query.

Feed-Forward Network:

The output from the self-attention layer is passed through a feed-forward neural network to further process the information.
Analogy: After comparing the documents, the RA refines their findings using additional analysis techniques.

Stacked Layers:

The encoder consists of multiple layers of self-attention and feed-forward networks.
Analogy: The RA goes through several rounds of reviewing and refining the information to ensure the highest quality analysis.

2/ Decoder

The decoder generates the output sequence by attending to the encoder’s output and the previously generated tokens.

Process:

Input Embeddings:

Each word in the partially generated output sequence is converted into an embedding.
Analogy: Each part of the report you are writing is represented by a detailed summary.

Positional Encoding:

Similar to the encoder, positional encodings are added to retain the order of the output sequence.
Analogy: The summaries include the position of each part of the report to understand the order.

Self-Attention:

Each word in the partially generated sequence attends to all previous words to gather contextual information.
Analogy: Each part of the report considers all previously written parts to maintain coherence.

Cross-Attention:

The decoder attends to the encoder’s output to incorporate information from the input sequence.
Analogy: The RA generating the report uses the summaries from the other RAs to ensure the report is comprehensive and relevant.

Feed-Forward Network:

The output from the cross-attention layer is passed through a feed-forward neural network.
Analogy: The RA refines the report using additional analysis techniques.

Stacked Layers:

The decoder also consists of multiple layers of self-attention, cross-attention, and feed-forward networks.
Analogy: The RA goes through several rounds of reviewing and refining the report.

Detailed Examples

Let’s break down each component using a detailed example.

1/ Attention Mechanism

Analogy: You (the professor) ask each RA to find the most relevant information from the documents based on your specific question.

Queries (Q):

Intuition: The specific question or topic you want to find information about.
Analogy: You ask an RA, “Find information about neural networks.”
In Transformers: Each input token generates its own query vector through a learned linear transformation. This query represents what the token is seeking.

Keys (K):

Intuition: Descriptors that help identify what information is contained in each document.
Analogy: Each document has a summary or a set of keywords like “neural networks, deep learning, AI.”
In Transformers: Each input token also generates its own key vector through another linear transformation. This key represents the content of the token.

Values (V):

Intuition: The actual information or content that the RA will use if the document is found relevant.
Analogy: The value is the full content of the book or document.
In Transformers: Each input token generates a value vector through a third linear transformation. This value is the actual content to be attended to.

Process:

Each RA matches your query against the document summaries (keys).
They compute a score for each document based on the match.
They use these scores to weigh the importance of the content (values) from each document.
Finally, they compile a summary based on the most relevant content.

2/ Self-Attention

Analogy: Each RA independently looks at all the documents, comparing each document to all other documents to find the most relevant information for every part of your question.

Example:

Suppose your query is a document with sections asking about “neural networks,” “training methods,” and “applications.”
Each section of the query is compared with all sections of all documents.
The RA dynamically determines which sections of the documents are most relevant to each part of the query.

In-Depth Process:

Generate Q, K, V:

Each word in a sentence generates its own Q, K, and V vectors.
For example, for the sentence “The cat sat on the mat,” the word “cat” generates its own Q, K, and V vectors.

Compute Similarity Scores:

The query vector for “cat” is compared with the key vectors of all words, including itself.
Scores indicate how relevant each word is to “cat.”

Calculate Attention Weights:

Apply a softmax function to these scores to get weights.
These weights determine how much attention “cat” should give to each word.

Weighted Sum of Values:

The final representation of “cat” is a weighted sum of the value vectors, where the weights come from the attention scores.

3/ Multi-Head Attention

Analogy: You have multiple RAs, each with a different area of expertise. They independently perform the attention process and then combine their findings to provide a comprehensive answer.

Example:

RA1 is an expert in neural network architectures.
RA2 is an expert in training methods.
RA3 is an expert in applications of neural networks.
Each RA performs the attention process on the documents independently.
They then combine their findings to give you a detailed and nuanced summary.

In-Depth Process:

Multiple Heads:

Suppose we use 3 heads. Each head has its own set of linear transformations to generate Q, K, and V.

Independent Attention:

Head 1 might focus on subject-verb relationships.
Head 2 might focus on positional relationships.
Head 3 might focus on noun-adjective relationships.

Combine Outputs:

Each head produces its own context-aware representations.
These representations are concatenated and passed through a final linear layer to produce the final multi-head attention output.

4/ Cross-Attention

Analogy: You have two groups of RAs: one group summarizes a set of input documents, and the other group generates a report based on your specific questions, focusing on the summaries provided by the first group.

Example:

Group 1 RAs summarize the input documents (encoder).
You provide specific questions to Group 2 RAs (decoder).
Group 2 RAs use the summaries from Group 1 to generate a report, focusing on the most relevant parts of the summaries.

In-Depth Process:

Encoder Output:

The encoder processes the entire input sequence and produces a set of encoded representations (one for each token).

Decoder Queries:

The decoder generates queries from its previous outputs (e.g., starting with the start token).

Cross-Attention Calculation:

The decoder queries are compared against the encoder keys (encoded representations).
The similarities determine the weights assigned to the encoder values.
The weighted sum of encoder values produces context-aware representations for the decoder to generate the next token.

Detailed Example for Each Concept

Attention Mechanism:

Sentence: “The cat sat on the mat.”
Query: “sat” (looking to translate this word).
Keys: Contextual words (“The,” “cat,” “sat,” “on,” “the,” “mat”).
Values: Actual content (embeddings of these words).

Process:

Compute similarity between “sat” and each word.
Apply softmax to get attention weights.
Multiply these weights with the embeddings to get a context-aware representation of “sat.”

Self-Attention:

Sentence: “The cat sat on the mat.”
Process: Each word attends to all words, including itself.
For “cat”:
Query from “cat.”
Keys from all words.
Values from all words.
Compute similarities, apply softmax, get weighted sum, update “cat’s” representation.

Multi-Head Attention:

Sentence: “The cat sat on the mat.”
Process: Use multiple attention heads to capture different aspects.
Head 1: Focuses on subject-verb.
Head 2: Focuses on positional.
Head 3: Focuses on noun-adjective.
Combine outputs of all heads for a richer representation.

Cross-Attention:

Input Sentence (English): “The cat sat on the mat.”
Output Sentence (French): “Le chat s’est assis sur le tapis.”

Process:

Encoder processes the English sentence, produces encoded representations.
Decoder starts generating the French sentence, uses cross-attention to focus on the encoder’s output while producing each word.
For “Le” (initial word in decoder):
Query from decoder’s previous state.
Keys from encoder’s output.
Values from encoder’s output.
Compute similarities, get weighted sum, generate next word “chat.”

Summary

Attention Mechanism: An RA focuses on finding relevant parts of the documents for a given query.
Self-Attention: Each RA considers all parts of the input simultaneously to understand context.
Multi-Head Attention: Multiple RAs, each with a different focus, analyze the input independently and combine their results for a richer understanding.
Cross-Attention: Decoder RAs use summaries provided by encoder RAs to generate a coherent and contextually relevant output.

These mechanisms together enable transformers to effectively process and generate sequences with rich, context-aware representations.

Understanding Transformers (Part 5)

Putting It All Together

Transformers and Attention Mechanism

Attention Mechanism

Components of Transformers

1/ Encoder

Input Embeddings:

Positional Encoding:

Self-Attention:

Feed-Forward Network:

Stacked Layers:

2/ Decoder

Input Embeddings:

Positional Encoding:

Self-Attention:

Cross-Attention:

Feed-Forward Network:

Stacked Layers:

Detailed Examples

1/ Attention Mechanism

Queries (Q):

Keys (K):

Values (V):

Process:

2/ Self-Attention

Generate Q, K, V:

Compute Similarity Scores:

Calculate Attention Weights:

Weighted Sum of Values:

3/ Multi-Head Attention

Multiple Heads:

Independent Attention:

Combine Outputs:

4/ Cross-Attention

Encoder Output:

Decoder Queries:

Cross-Attention Calculation:

Detailed Example for Each Concept

Attention Mechanism:

Self-Attention:

Multi-Head Attention:

Cross-Attention:

Summary

Written by Sebastien Callebaut