Understanding Transformers (Part 4)

A Clear View On All Attention Mechanisms

Sebastien Callebaut
5 min readMay 21, 2024

Attention

Analogy: You (the professor) ask each research assistant (RA) to find the most relevant information from the documents based on your specific question.

Photo by Growtika on Unsplash

Queries (Q):

  • Intuition: The specific question or topic you want to find information about.
  • Analogy: You ask an RA, “Find information about neural networks.”
  • In Transformers: Each input token generates its own query vector through a learned linear transformation. This query represents what the token is seeking.

Keys (K):

  • Intuition: Descriptors that help identify what information is contained in each document.
  • Analogy: Each document has a summary or a set of keywords like “neural networks, deep learning, AI.”
  • In Transformers: Each input token also generates its own key vector through another linear transformation. This key represents the content of the token.

Values (V):

  • Intuition: The actual information or content that the RA will use if the document is found relevant.
  • Analogy: The value is the full content of the book or document.
  • In Transformers: Each input token generates a value vector through a third linear transformation. This value is the actual content to be attended to.

Process:

  • Each RA matches your query against the document summaries (keys).
  • They compute a score for each document based on the match.
  • They use these scores to weigh the importance of the content (values) from each document.
  • Finally, they compile a summary based on the most relevant content.

Self-Attention

Analogy: Each RA independently looks at all the documents, comparing each document to all other documents to find the most relevant information for every part of your question.

Example:

  • Suppose your query is a document with sections asking about “neural networks,” “training methods,” and “applications.”
  • Each section of the query is compared with all sections of all documents.
  • The RA dynamically determines which sections of the documents are most relevant to each part of the query.

In-Depth Process:

Generate Q, K, V:

  • Each word in a sentence generates its own Q, K, and V vectors.
  • For example, for the sentence “The cat sat on the mat,” the word “cat” generates its own Q, K, and V vectors.

Compute Similarity Scores:

  • The query vector for “cat” is compared with the key vectors of all words, including itself.
  • Scores indicate how relevant each word is to “cat.”

Calculate Attention Weights:

  • Apply a softmax function to these scores to get weights.
  • These weights determine how much attention “cat” should give to each word.

Weighted Sum of Values:

  • The final representation of “cat” is a weighted sum of the value vectors, where the weights come from the attention scores.

Multi-Head Attention

Analogy: You have multiple RAs, each with a different area of expertise. They independently perform the attention process and then combine their findings to provide a comprehensive answer.

Example:

  • RA1 is an expert in neural network architectures.
  • RA2 is an expert in training methods.
  • RA3 is an expert in applications of neural networks.
  • Each RA performs the attention process on the documents independently.
  • They then combine their findings to give you a detailed and nuanced summary.

In-Depth Process:

Multiple Heads:

  • Suppose we use 3 heads. Each head has its own set of linear transformations to generate Q, K, and V.

Independent Attention:

  • Head 1 might focus on subject-verb relationships.
  • Head 2 might focus on positional relationships.
  • Head 3 might focus on noun-adjective relationships.

Combine Outputs:

  • Each head produces its own context-aware representations.
  • These representations are concatenated and passed through a final linear layer to produce the final multi-head attention output.

Cross-Attention

Analogy: You have two groups of RAs: one group summarizes a set of input documents, and the other group generates a report based on your specific questions, focusing on the summaries provided by the first group.

Example:

  • Group 1 RAs summarize the input documents (encoder).
  • You provide specific questions to Group 2 RAs (decoder).
  • Group 2 RAs use the summaries from Group 1 to generate a report, focusing on the most relevant parts of the summaries.

In-Depth Process:

Encoder Output:

  • The encoder processes the entire input sequence and produces a set of encoded representations (one for each token).

Decoder Queries:

  • The decoder generates queries from its previous outputs (e.g., starting with the start token).

Cross-Attention Calculation:

  • The decoder queries are compared against the encoder keys (encoded representations).
  • The similarities determine the weights assigned to the encoder values.
  • The weighted sum of encoder values produces context-aware representations for the decoder to generate the next token.

Detailed Example for Each Concept

Attention Mechanism:

  • Sentence: “The cat sat on the mat.”
  • Query: “sat” (looking to translate this word).
  • Keys: Contextual words (“The,” “cat,” “sat,” “on,” “the,” “mat”).
  • Values: Actual content (embeddings of these words).

Process:

  • Compute similarity between “sat” and each word.
  • Apply softmax to get attention weights.
  • Multiply these weights with the embeddings to get a context-aware representation of “sat.”

Self-Attention:

  • Sentence: “The cat sat on the mat.”

Process:

  • Each word attends to all words, including itself.
  • For “cat”:
  • Query from “cat.”
  • Keys from all words.
  • Values from all words.
  • Compute similarities, apply softmax, get weighted sum, update “cat’s” representation.

Multi-Head Attention:

  • Sentence: “The cat sat on the mat.”

Process:

  • Use multiple attention heads to capture different aspects.
  • Head 1: Focuses on subject-verb.
  • Head 2: Focuses on positional.
  • Head 3: Focuses on noun-adjective.
  • Combine outputs of all heads for a richer representation.

Cross-Attention:

  • Input Sentence (English): “The cat sat on the mat.”
  • Output Sentence (French): “Le chat s’est assis sur le tapis.”

Process:

  • Encoder processes the English sentence, produces encoded representations.
  • Decoder starts generating the French sentence, uses cross-attention to focus on the encoder’s output while producing each word.
  • For “Le” (initial word in decoder):
  • Query from decoder’s previous state.
  • Keys from encoder’s output.
  • Values from encoder’s output.
  • Compute similarities, get weighted sum, generate next word “chat.”

Summary:

  • Attention Mechanism: An RA focuses on finding relevant parts of the documents for a given query.
  • Self-Attention: Each RA considers all parts of the input simultaneously to understand context.
  • Multi-Head Attention: Multiple RAs, each with a different focus, analyze the input independently and combine their results for a richer understanding.
  • Cross-Attention: Decoder RAs use summaries provided by encoder RAs to generate a coherent and contextually relevant output.

These mechanisms together enable transformers to effectively process and generate sequences with rich, context-aware representations.

--

--

Sebastien Callebaut

Using data and coding to make better investing decisions. Co-founder of stockviz.com