Member-only story
Context Rot: How Increasing Input Tokens Impacts LLM Performance
Large Language Models (LLMs) are typically presumed to process context uniformly — that is, the model should handle the 10,000th token just as reliably as the 100th. However, in practice, this assumption does not hold. We observe that model performance varies significantly as input length changes, even on simple tasks.
In this blog, we will evaluate different LLMs, including the state-of-the-art GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models. Let’s see how models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows.
Table Of Contents
- How Do Transformers Work?
- Problems With Benchmarking
- Needle in a Haystack Extension
- Haystack Structure
- Other Experiments
- Conclusion
How Do Transformers Work?
Transformers have dominated the AI landscape since their introduction in 2017’s “Attention Is All You Need” paper. There are many reasons as to why they work so great, but if I have to explain it in simple terms. It is their ability to route information in a weighted manner.
- Through content-based, dynamic…

