Infini-Attention: Infinite Context for LLMs
Increasing the context window for LLMs has been a long struggle. Over the years we have invented a lot of techniques, but the progress has been slow and tedious. In some way, to solve the issues of memory we even engineered RAG pipelines, acting out as a semi-context window for LLMs. Context window is like a short-term memory for LLMs, the bigger the window is, the bigger context we can fit in there and thus enable a better or more nuanced answer. So, in today’s blog, we are going to look into how Google’s DeepMind invented infinite context window for LLMs.
Table of Contents
- Understanding Attention Memory Requirements
- Flash Attention 2.0
- Methods with Attention Approximation
- Memory with Associative Bindings
- Infinite Context Window
- Conclusion
Understanding Attention Memory Requirements
The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length.