Infini-Attention: Infinite Context for LLMs

Published in

AIGuys

10 min readApr 29, 2024

Increasing the context window for LLMs has been a long struggle. Over the years we have invented a lot of techniques, but the progress has been slow and tedious. In some way, to solve the issues of memory we even engineered RAG pipelines, acting out as a semi-context window for LLMs. Context window is like a short-term memory for LLMs, the bigger the window is, the bigger context we can fit in there and thus enable a better or more nuanced answer. So, in today’s blog, we are going to look into how Google’s DeepMind invented infinite context window for LLMs.

Understanding Attention Memory Requirements
Flash Attention 2.0
Methods with Attention Approximation
Memory with Associative Bindings
Infinite Context Window
Conclusion

Understanding Attention Memory Requirements

The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length.

Infini-Attention: Infinite Context for LLMs

Table of Contents

Understanding Attention Memory Requirements

Written by Vishal Rajput