LLM infini-attention with linear complexity

Kevin François
neoxia
Published in
12 min readApr 26, 2024

Introduction

LLM (Large Language Models) architectures have been significantly influenced by the use of attention mechanisms within Transformer-type models. These mechanisms have revolutionized how models understand and generate text by allowing them to focus on relevant parts of input sequences. However, as models become more complex and are tasked with processing increasingly longer sequences, they encounter challenges such as attention windows and quadratic complexity. In response to these challenges, new solutions such as Google’s Infini-attention have emerged, aiming to overcome limitations while maintaining efficiency. Our goal in this medium is to present the intricacies of attention mechanisms, explore the issues they face, and highlight the solutions proposed by Infini-attention.

Attention mechanism

In the realm of advanced machine learning architectures like Transformers and Transformer-based Large Language Models (LLMs), the attention mechanism plays a pivotal role in enabling contextual understanding and generation of text. At the heart of these mechanism lies the ability to focus on specific parts of input data, assigning varying degrees of importance to different elements, a feat achieved through the attention mechanism. Self-attention facilitates the understanding of contextual relationships within a sequence through a series of intricate computations.

Figure 1 : Attention Mechanism (blue) for Next Word Prediction (red)

For instance, consider the sentence: “The cat sat on the mat.” Within this sequence, self-attention enables the model to recognize the relationships between words. By assigning attention weights based on relevance, the model understands that “cat” and “mat” are closely related, as they have high attention weights, indicating their roles as subject and object in the action “sat.”

The process unfolds through several steps:

  • Sequence : Based on the input words, you can tokenize each word in a vector of size d and combine each vector in a sequence S of N tokens.
  • Key, Query, and Value Vectors: Each sequence is associated with three vectors: Key (K), Query (Q), and Value (V). The Key vectors serve as comparisons, the Query vector represents the current word, and the Value vectors hold information about all other words.
  • Dot Product Calculation: Attention scores for each sequence are evaluated through a dot product operation between the Query vector of the current word and the Key vectors of all other words. The calculation is represented by the attention equation :

To prevent gradients from becoming too large, the dot products are scaled down by dividing by the square root of the dimension of the Key vectors dk. The scaled dot products is then passed through a softmax function, normalizing the scores into a probability distribution. This ensures that the attention weights sum up to 1, facilitating relative importance comparison among sequences. Finally we obtained an attention matrix with each row represent the attention score associated to each previous tokens in the sequence of N input tokens.

Finally, the attention weights obtained from the softmax operation and store in our attention matrix are used to compute a weighted sum of the Value vectors of all words, incorporating information from other words based on their importance scores as shown in the image bellow. Then the local attention is compute on the new input sequence of tokens.

Figure 2 : Combination of all attention score to compute the predicted output

Through these intricate computations, the self-attention mechanism empowers models to focus on relevant parts of the input sequence, enhancing their understanding of contextual relationships within textual data. This mechanism has transformed how the way machines comprehend and generate human-like text, however attention mechanism generate multiple issues such as attention windows and quandratic complexity.

Attention windows

Transformers are powerful tools for natural language processing, but their reliance on positional encodings introduces limitations when dealing with long sequences. Here’s how:

Figure 3: illustration of positional encoding concept, which assigns a unique representation to each position within a sequence, thereby describing the location or position of an entity.
  • Positional Encodings: Standard Transformers struggle with long sequences because they depend on positional encodings to understand the order of words. These encodings are like unique IDs assigned to each word based on its position. For instance, the first word gets a unique “first-word” encoding, the second gets a “second-word” encoding, and so on.
  • Limited Context Window: The problem arises because these encodings are typically learned for a fixed window size, say 512 positions. If a sequence goes beyond 512 words, the model encounters positions it hasn’t seen during training. It doesn’t have specific encodings for these unseen positions, and therefore, it struggles to understand the relationships between words that are further apart.
  • Inability to Extrapolate: Imagine the model encountering the word “book” in the 1000th position. Since it hasn’t learned an encoding for the 1000th position, it can’t determine how “book” relates to surrounding words based solely on their positions.

In essence, positional encodings create an attention window. The model can effectively pay attention to words within this window and understand their relationships, but it struggles to grasp how words outside the window interact with each other. This limitation becomes a major issue when dealing with very long documents, conversations, or code where understanding long-range dependencies is crucial.

Quadratic complexity

The second issue is the evaluation of the attention matrix which involves learning mappings from each token to every other token in the sequence, resulting in a quadratic cost. This cost arises because every input element needs to interact with every other element, leading to a computational complexity of O(n²), where n is the length of the sequence. This quadratic cost is a big challenge, especially with longer sequences, and has been a focus of research recently. For exemple, the attention Key-Value (KV) states have 3TB memory footprint for a 500B model with batch size 512 and context length 2048

To address the challenges posed by this quadratic complexity, particularly as token lengths grow larger, strategies for scaling up model sizes become essential. However, scaling up introduces tradeoffs in terms of inference cost, which are measured through various metrics such as latency, throughput, and model FLOPS utilization.

Latency

Latency, in the context of model inference, refers to the total time required to complete an inference task. It consists of two main components:

  1. Prefill Latency: This component encompasses the time taken to process the input tokens at the beginning of the inference process. Prefill latency is crucial as it sets the initial conditions for subsequent decoding steps.
  2. Decode Latency: Decode latency refers to the time taken to autoregressively generate output tokens based on the processed input tokens. It involves predicting each token sequentially, which can be a computationally intensive task, especially for long sequences.

Throughput

Throughput is a metric that quantifies the rate at which tokens are processed or generated during inference. It provides insights into the efficiency of model inference and is typically measured in tokens per second. Higher throughput indicates faster model inference, which is desirable for real-time applications or large-scale deployments.

Model FLOPS

Floating Point Operations Per Second) utilization measures the efficiency of hardware usage during inference. It compares the observed throughput (actual number of tokens processed or generated per second) with the theoretical maximum throughput that could be achieved if the hardware were operating at peak FLOPS with no memory or communication overhead. When discussing model FLOPS utilization, it’s essential to consider factors such as the computational intensity of the operations performed during inference, including matrix multiplications (matmuls) and attention mechanism calculations. These operations consume computational resources and contribute to the overall inference time.

In conclusion, while scaling Transformer-based large language models (LLMs) can enhance performance, it incurs a computational and memory burden. Furthermore, these models exhibit limitations in handling long sequences due to inherent constraints in positional encoding. This restricts their ability to capture long-range dependencies, hindering their effectiveness in tasks requiring comprehension of extensive contextual information. The Google paper, “Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention,” proposes the Infini-Attention mechanism to address these shortcomings. This novel approach empowers Transformers to efficiently process sequences of any length without compromising performance, paving the way for advancements in tasks that necessitate analysis of vast amounts of textual data.

Infinity attention

Basic concept

Central to Google paper approach is the development of a new attention mechanism called Infini-attention, which integrates compressive memory into the standard attention mechanism of Transformers. At the base of these concept are two publications:

  • Transformer-XL: Attentive language models beyond a fixed-length context: introduces an extension to the standard Transformer architecture. This extension, called Transformer-XL, allows the model to effectively capture long-range dependencies in sequences. Unlike standard Transformers, Transformer-XL is not limited by a fixed-length context window. It achieves this by employing a recurrence mechanism. This mechanism essentially builds up the model’s memory from previous segments, enabling it to maintain a broader context size for processing the current segment.
  • Transformers are RNNs: proposes a novel linear attention mechanism that offers a significant computational advantage compared to the standard dot-product attention used in Transformers. The standard attention mechanism has a quadratic time and space complexity. In contrast, the proposed linear attention mechanism operates in linear time and space, making it much more efficient.

The approach of Google paper proposed to split the long sequence into smaller segments (S) with a fixed number of tokens (N). The model processes these segments one at a time.

  1. Memory State: A hidden state carries information about previously processed segments. This state is updated as the model moves through the sequence. You can think of it as a memory that captures what the model has learned so far.
  2. Local Attention: Within each segment, the model performs “local attention”. This attention mechanism only considers the tokens within the current segment (not the entire sequence).
  3. Combining Information: Importantly, the local attention mechanism also incorporates the hidden state. This hidden state, which is updated based on previous segments, injects information about the broader context beyond the current segment.

To sum up, This approach avoids the O(n²) complexity of full attention on the entire sequence. By combining local attention with the informative hidden state, the model can still capture long-range dependencies in the sequence despite processing segments individually.

Figure 4 : Infini-Transformer (top) has an entire context history whereas Transformer-XL (bottom) discards old contexts since it caches the KV states for the last segment only [1]

Technical Implementation:

We can view Infini-Attention as a clever combination of ideas from previous papers, enabling efficient integration of long-term attention with memory efficiency. Here, we focus on a high-level understanding of each step and how they work together, rather than diving into the equations :

  1. Local Attention: Similar to standard Transformers, each segment of N input tokens undergoes self-attention, as explained previously.
  2. Memory retrieval & update: This step holds the key innovation. Here, the objective is to break down the computationally expensive softmax function and its transformation into a form resembling a recursive function. The challenge lies in the nature of softmax — it’s a quadratic function that generally resists decomposition. In standard Transformers, this forces the model to store all key-query values for the final softmax calculation, leading to inefficiency. The solution borrows an idea from the “Transformers are RNNs” paper. It proposes approximating the softmax equation to make it linearizable and allow for recursive computation, similar to how RNNs operate. This approach enables the model to iteratively update and pass on the overall attention memory stored within the hidden state throughout the sequence.
  3. Delta Rule : The delta rule attempts to improve the memory update process by first retrieving the existing value entries from the memory, and then subtracting them from the new values before applying the associative bindings as the update. The aim is to leaves the associative matrix (which stores the key-value bindings) unmodified if the new key-value binding already exists in the memory.
  4. Gating and Combining Information: Finally**, a** gating function is used to combine the retrieved memory content in the hidden space memory with the local attention output from the new sequence. This balances the influence of long-term context and current segment information in the final attention output.

This approach allows Infini-Attention to efficiently access historical information while maintaining bounded memory usage, enabling Transformers to effectively process infinitely long sequences.

Experimental Validation:

To gauge the effectiveness of Infini-Attention, the authors conducted various experiments involving long context tasks. Here, we’ll focus on one specific experiment (refer to the paper for a comprehensive evaluation). They compared Infini-Attention with other models designed for long context and memory efficiency, such as Transformers-XL and Recurrent Memory Transformer (RMT).

The experiment involved training and evaluating small Infini-Transformer models on two long-context language modeling benchmarks: PG19 and Arxiv-math. These Infini-Transformer models had a specific configuration:

  • 12 layers
  • 8 attention heads, each with a dimension of 128
  • Feedforward networks with a hidden layer size of 4096

For Infini-Attention, the segment length within each layer was set to 2048 tokens. During training, the input sequence length was set to 32768 tokens. This allowed the Infini-Attention model to process the sequence in 16 steps (unrolls) with respect to its compressed memory states.

It’s important to note that the RMT baseline was evaluated with different configurations to find the optimal settings. The best performing RMT setup used 100 summary vectors trained on sequences with a length of 8196 tokens.

Table 1 : Long-context language modeling results are compared in terms of average tokenlevel perplexity. Comp. denotes compression ratio. Infini-Transformer outperforms memorizing transformers with memory length of 65K and achieves 114x compression ratio.

To evaluate the performance, the paper employed perplexity as the metric. Perplexity essentially reflects how well a model predicts unseen data. Lower perplexity scores indicate better performance, meaning the model is less surprised by new information and has higher predictive accuracy.

The key findings are summarized in the table above. Notably, Infini-Transformer outperformed both Transformer-XL and Memorizing Transformers, on these long-context modeling tasks. Importantly, the Infini-Transformer achieved this performance while maintaining a 114x smaller memory footprint compared to the Memorizing Transformer model

Furthermore, experiments showed that increasing the training sequence length from 32,000 tokens to 100,000 tokens led to further improvements in perplexity scores for Infini-Transformer models. This suggests that Infini-Transformer can effectively handle even longer sequences while maintaining efficiency.

Conclusion

Large language models (LLMs) have been revolutionized by the introduction of attention mechanisms within Transformer architectures. These mechanisms have fundamentally changed how models understand and generate text by enabling them to focus on crucial parts of input sequences. However, as models become more complex and handle increasingly longer sequences, challenges like attention windows and quadratic complexity emerge.

In response to these issues, Google’s Infini-Attention mechanism offers a groundbreaking solution. By integrating a compressed memory module into the standard Transformer attention mechanism, Infini-Attention tackles these limitations while maintaining efficiency. This paves the way for significant advancements in natural language processing, allowing models to handle vast amounts of textual data with minimal computational resources.

Key Takeaways:

  • Traditional Transformers struggle with long sequences due to limitations in positional encoding, hindering their ability to capture long-range dependencies.
  • Infini-Attention addresses these limitations by incorporating a compressed memory module that stores historical information efficiently.
  • This approach overcomes the quadratic complexity bottleneck associated with standard attention mechanisms, enabling efficient processing of infinitely long sequences.
  • Experiments demonstrate that Infini-Transformer models outperform previous solutions like Transformer-XL and Recurrent Memory Transformers on long-context language modeling tasks, achieving this with a significantly smaller memory footprint.

Future Implications:

The development of Infini-Attention opens doors for advancements in various NLP applications that require analysis of extensive textual data. These applications include:

  • Document summarization: Infini-Attention models can effectively capture long-range dependencies within documents, leading to more comprehensive and informative summaries.
  • Question answering over factual databases: By efficiently processing vast amounts of factual information, Infini-Attention models can provide more accurate and insightful answers to complex questions.
  • Machine translation: The ability to handle long sequences allows Infini-Attention models to capture context more effectively, leading to improved translation quality, especially for languages with complex syntactic structures.

Overall, Infini-Attention could present a significant advancement in the field of LLMs. Its ability to efficiently handle infinitely long sequences could lead to the development of more powerful and versatile language models that can unlock new possibilities in natural language processing.

Ressource

  1. MUNKHDALAI, Tsendsuren, FARUQUI, Manaal, et GOPAL, Siddharth. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention. arXiv preprint arXiv:2404.07143, 2024.
  2. https://machinelearningmastery.com/the-attention-mechanism-from-scratch/
  3. https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/
  4. KELES, Feyza Duman, WIJEWARDENA, Pruthuvi Mahesakya, et HEGDE, Chinmay. On the computational complexity of self-attention. In : International Conference on Algorithmic Learning Theory. PMLR, 2023. p. 597–619.
  5. DAI, Zihang, YANG, Zhilin, YANG, Yiming, et al. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019. KATHAROPOULOS, Angelos, VYAS, Apoorv, PAPPAS, Nikolaos, et al. Transformers are rnns: Fast autoregressive transformers with linear attention. In : International conference on machine learning. PMLR, 2020. p. 5156–5165.
  6. CHEVALIER, Alexis, WETTIG, Alexander, AJITH, Anirudh, et al. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.

--

--