Understanding Transformers (Part 2)

Attention, Self-Attention, Cross-Attention

Sebastien Callebaut
4 min readMay 21, 2024

Let’s delve deeper into the attention mechanism in transformers, exploring its intricacies and how it revolutionized natural language processing.

Photo by Igor Omilaev on Unsplash

Core Concept of Attention

At its core, the attention mechanism allows a model to focus on different parts of the input sequence when producing each part of the output sequence. This enables the model to weigh the importance of different input tokens dynamically.

Components of Attention Mechanism

Queries, Keys, and Values:

  • Queries (Q): Represent the entity that is seeking information.
  • Keys (K): Represent the entities in the dataset that can be matched against the query.
  • Values (V): Represent the actual information or content associated with each key.
  • In the context of a transformer, each input token produces a query, a key, and a value through learned linear transformations.

Scaled Dot-Product Attention:

  • The attention mechanism computes a score (or weight) for each key by taking the dot product of the query with all keys.
  • These scores are then scaled by the square root of the dimension of the keys to stabilize gradients.
  • The scaled scores are passed through a softmax function to obtain attention weights, which represent the importance of each key relative to the query.
  • The final output is a weighted sum of the values, using the attention weights.

Mathematically, the attention output is computed as:

Where 𝑑𝑘 is the dimension of the keys.

Multi-Head Attention

A single attention mechanism might not be sufficient to capture different types of relationships in the data. Hence, transformers use multi-head attention, which involves:

Multiple Sets of Q, K, V:

  • The input embeddings are linearly projected into multiple sets of queries, keys, and values.
  • Each set (or head) attends to different parts of the input sequence, allowing the model to capture various aspects of the data.

Parallel Attention Mechanisms:

  • Each head performs the attention operation independently, producing different outputs.

Concatenation and Final Linear Layer:

  • The outputs from all heads are concatenated and passed through a final linear layer to produce the final output.

Self-Attention vs. Cross-Attention

Self-Attention:

  • Each token in the input sequence attends to all other tokens, including itself.
  • This mechanism is used in both the encoder and decoder layers of the transformer.
  • It allows the model to consider the entire input sequence for each token, enabling context-aware processing.

Cross-Attention:

  • Used in the decoder when generating the output sequence.
  • Here, the queries come from the decoder’s previous layer, and the keys and values come from the encoder’s output.
  • This allows the decoder to attend to the encoder’s output sequence, integrating information from the input sequence into the generation of the output.

Advantages of Attention Mechanism

Parallelization:

  • Unlike RNNs, which process tokens sequentially, transformers process the entire sequence simultaneously, allowing for significant parallelization and faster training.

Long-Range Dependencies:

  • The attention mechanism can capture dependencies between tokens irrespective of their distance in the sequence, addressing the limitations of RNNs in handling long-range dependencies.

Flexibility:

  • Attention weights are dynamically computed for each input, allowing the model to adaptively focus on relevant parts of the sequence for each token.

Why It Wasn’t Done Before

Computational Constraints:

  • The attention mechanism, especially multi-head attention, requires substantial computational resources and efficient parallel processing capabilities.
  • Advances in hardware, particularly GPUs and TPUs, made it feasible to implement and train large-scale transformer models.

Algorithmic Innovations:

  • The theoretical foundation for attention mechanisms, particularly the scaled dot-product attention and multi-head attention, provided a robust framework for dynamic weighting and parallel processing.
  • These innovations were formalized and proven effective in Vaswani et al.’s 2017 paper “Attention is All You Need.”

Sequential Dependencies:

  • Traditional models like RNNs were inherently sequential, making it challenging to process sequences in parallel.
  • The transformer architecture, by design, eschews sequential processing in favor of parallelization, leveraging the attention mechanism to maintain contextual relationships.

Summary

The attention mechanism in transformers allows the model to dynamically focus on different parts of the input sequence, capturing complex relationships and long-range dependencies. This innovation, enabled by advances in computational power and algorithmic design, marked a significant leap in the capabilities of natural language processing models, paving the way for more accurate and efficient processing of large-scale textual data.

--

--

Sebastien Callebaut

Using data and coding to make better investing decisions. Co-founder of stockviz.com