Coffee Time Papers: Leave No Context Behind

Efficient Infinite Context Transformers with Infini-attention

Dagang Wei

6 min readMay 31, 2024

This blog post is part of the series Coffee Time Papers.

Paper

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long…

arxiv.org

Overview

This paper introduces Infini-attention, a new attention mechanism for Transformer-based Large Language Models (LLMs) that allows them to process infinitely long inputs with limited memory and computation. The key component of Infini-attention is a compressive memory that stores past key-value states, enabling the model to access and utilize information from much earlier in the input sequence. This is in contrast to traditional attention mechanisms, which have a fixed context window and can only access a limited amount of past information.

The Infini-attention mechanism combines local attention within a segment of the input sequence with long-term linear attention over the compressed memory. This allows the model to capture both short-term dependencies within the current segment and long-term dependencies across the entire input history. The authors demonstrate the effectiveness of their approach on various tasks, including long-context language modeling, passkey retrieval, and book summarization. Their results show that Infini-attention outperforms baseline models on these tasks while using significantly less memory.

In summary, this paper presents a novel approach to scaling Transformer-based LLMs to infinitely long inputs. The proposed Infini-attention mechanism offers a promising solution to the memory and computational constraints of traditional attention mechanisms, enabling LLMs to better leverage long-range dependencies in various natural language processing tasks.

Q & A

What is the main contribution of this paper?

This paper introduces Infini-attention, a novel attention mechanism for Transformer-based Large Language Models (LLMs) that enables them to process infinitely long inputs with bounded memory and computation. This is a significant advancement as traditional Transformer architectures struggle with long sequences due to the quadratic complexity of their attention mechanism.

How does Infini-attention work?

Analogy:

Imagine a student trying to write an essay on a complex topic. They have a vast library of books (representing the entire input sequence) but can only hold a few books at a time (representing the limited context window of traditional attention). They start by reading a few books, taking notes, and forming an initial understanding. However, as they continue reading, they forget details from the earlier books.

Infini-attention is like giving the student a special notebook (the compressive memory). As they read each book, they summarize the key points in the notebook. When they need to recall information from earlier books, they don’t have to go back and reread them; they can simply refer to their notes in the notebook. This allows them to maintain a much broader understanding of the topic without having to hold all the books simultaneously.

In this analogy:

The books represent the input sequence of text.
The student’s limited capacity represents the fixed context window of traditional attention.
The notebook represents the compressive memory.
The notes represent the compressed key-value states.
The act of summarizing represents the memory update process.
The act of referring to notes represents the memory retrieval process.

This analogy illustrates how Infini-attention allows language models to maintain a broader understanding of the input text by compressing and storing past information, enabling them to access and utilize it when needed, even if it’s far back in the sequence.

Technical details:

Infini-attention incorporates a compressive memory into the standard attention mechanism. It combines masked local attention within a segment of the input and long-term linear attention over the compressed memory. This allows the model to capture both short-term dependencies within the current segment and long-term dependencies across the entire input history. The compressive memory stores past key-value states, enabling access to information from much earlier in the input sequence, unlike traditional attention with a fixed context window.

Compressive memory in Infini-attention reuses the query (Q), key (K), and value (V) states from the standard dot-product attention computation. It stores bindings of key and value states and retrieves them using query vectors.

The memory is parameterized as an associative matrix (Ms) and a normalization term (zs). The retrieval process uses the query (Q) and the previous memory state (Ms−1) to retrieve new content (Amem):

Amem = σ(Q)Ms−1 / σ(Q)zs−1

where σ is a non-linear activation function (element-wise ELU + 1) and zs−1 is the normalization term (sum over all keys).

After retrieval, the memory is updated with the new key (K) and value (V) entries:

Ms ← Ms−1 + σ(K)TV 
zs ← zs−1 + (1/N)∑ σ(Kt)

where T denotes the transpose operation.

The authors also incorporate the delta rule, which improves memory update by subtracting existing value entries before applying new updates:

Ms ← Ms−1 + σ(K)T (V − σ(K)Ms−1 / σ(K)zs−1)

This update rule leaves the associative matrix unchanged if the key-value binding already exists in the memory.

The retrieved content (Amem) is then combined with the local attention context (Adot) using a learned gating scalar (β):

A = sigmoid(β) ⊙ Amem + (1 − sigmoid(β)) ⊙ Adot

This allows for a learnable trade-off between long-term and local information.

Overall, compressive memory in Infini-attention provides an efficient way to store and retrieve information from the entire input history, enabling the model to capture long-range dependencies while maintaining a bounded memory footprint.

What are the advantages of Infini-attention over traditional attention mechanisms?

Infinite Context: Infini-attention allows LLMs to process infinitely long inputs, overcoming the fixed context window limitation of traditional attention mechanisms.
Bounded Memory and Computation: It achieves this with bounded memory and computation resources, making it more efficient than previous methods for long-context modeling.
Improved Performance: Infini-attention outperforms baseline models on various tasks, including long-context language modeling, passkey retrieval, and book summarization.
Length Generalization: It demonstrates promising length generalization capabilities, with models fine-tuned on shorter sequences successfully handling much longer inputs.

What tasks were used to evaluate Infini-attention?

The authors evaluated Infini-attention on three main tasks:

Long-context language modeling: Predicting the next token in a sequence, given a long context.
Passkey retrieval: Finding a hidden number within a long text.
Book summarization: Generating a summary of an entire book.

What were the results of the evaluation?

Infini-attention outperformed baseline models on all evaluated tasks while using significantly less memory. In long-context language modeling, it achieved better perplexity scores than Transformer-XL and Memorizing Transformers. In passkey retrieval, a 1B model fine-tuned on 5K length sequences solved the task for 1M length inputs. In book summarization, an 8B model with Infini-attention achieved state-of-the-art results on the BookSum dataset.

What are the potential implications of this work?

This work has the potential to significantly improve the capabilities of LLMs by allowing them to process much longer inputs. This could lead to advancements in various natural language processing applications, such as:

Document understanding: Better comprehension of long documents like research papers or legal texts.
Summarization: Generating more accurate and informative summaries of lengthy texts.
Question answering: Answering complex questions that require understanding of a large context.
Continual learning: Adapting to new information and knowledge more effectively.

What are the limitations of this work?

Focus on Technical Aspects: The paper primarily focuses on the technical details and evaluation of Infini-attention, with less discussion of its broader implications and potential applications.
Computational Complexity: While Infini-attention addresses memory constraints, it may introduce other challenges, such as increased computational complexity, which need further investigation.
Real-world Applications: More research is needed to explore the effectiveness and efficiency of Infini-attention in real-world scenarios and on a wider range of tasks.