Infini-Attention: Infinite Context for LLMs

Vishal Rajput
AIGuys
Published in
10 min readApr 29, 2024

--

Increasing the context window for LLMs has been a long struggle. Over the years we have invented a lot of techniques, but the progress has been slow and tedious. In some way, to solve the issues of memory we even engineered RAG pipelines, acting out as a semi-context window for LLMs. Context window is like a short-term memory for LLMs, the bigger the window is, the bigger context we can fit in there and thus enable a better or more nuanced answer. So, in today’s blog, we are going to look into how Google’s DeepMind invented infinite context window for LLMs.

Table of Contents

  • Understanding Attention Memory Requirements
  • Flash Attention 2.0
  • Methods with Attention Approximation
  • Memory with Associative Bindings
  • Infinite Context Window
  • Conclusion
Photo by Angely Acevedo on Unsplash

Understanding Attention Memory Requirements

The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length.

--

--