Why I’m not worried about LLMs long context problem.
Short term pain that will eventually be solved.
Someone made this comment on X, “LLMs are still incredibly bad at long context, severe drop in response quality from the best of the best models (o1, Claude, grok, DeepSeek), doesn’t really matter what model — it will choke.”
Initially I related because I’ve seen how ChatGPT (the good, bad, and ugly) handle long texts or even multiple texts at once. ChatGPT can often struggle to infer, connect the dots, or synthesize information when I give it a specific task.
I saw a post of a someone who posted a graph of a study which I‘m certain the initial comment was referring too of the Model Performance vs. Context Length. In short, the research shows that LLMs struggle with long context, thus verifying my experience.
However, I’m not concerned because I’ve seen improvements and it will only get better. It also depends on how the user uses AI. Mastering prompt engineering is essential to get around this issue. Chain of thought prompting, and more. If one don’t want that kind of hassle or find loopholes or workaround to this flaw, I can see why long context can still be an issue that is worthwhile to tackle, and why solving this could improve the overall outputs of these models.
Me: Provide thoughts and feedback to this claim made: LLMs are still incredibly bad at long context, severe drop in response quality from the best of the best models (o1, Claude, grok, DeepSeek), doesn’t really matter what model — it will choke
ChatGPT: This claim — “LLMs are still incredibly bad at long context, with severe drop in response quality from even the best models (o1, Claude, Grok, DeepSeek); it doesn’t really matter what model — it will choke” — is worth dissecting carefully. While there’s truth to the idea that LLMs struggle with long-context retention and response quality degradation, there are nuances and counterpoints that make the situation more complex than an outright condemnation. Let’s break it down.
1. What Does “Long Context” Actually Mean?
- The phrase “long context” is ambiguous. Does it mean:
- Token limit approach? (e.g., the model’s performance degrades when near its token cap)
- Information retrieval over long sequences? (e.g., difficulty referencing details from the beginning of a conversation)
- Logical coherence over long discourse? (e.g., maintaining structured, high-quality responses without hallucination)
- Sustaining consistent reasoning across multiple messages? (e.g., carrying insights over an extended conversation)
- LLMs today (especially Claude 3 Opus, GPT-4-turbo, and DeepSeek) have different failure modes depending on what aspect of “long context” you’re testing.
- If we define “long context” purely in terms of token windows, then models like Claude 3 Opus and Gemini 1.5 have significantly improved (with reported capabilities of 200k+ tokens). However, retrieval within those windows isn’t perfect.
2. The “Severe Drop in Response Quality” Issue
- It’s true that models can degrade in quality as the context window extends, but not always in a catastrophic way.
- Some primary reasons for this degradation:
- Recency Bias / Positional Attention: Most LLMs prioritize more recent tokens, sometimes leading to them forgetting earlier details in favor of later ones.
- Compression Artifacts: Some models use heuristics to “compress” older information, leading to summarization errors or oversimplifications.
- Loss of Deep Understanding: Even if the model technically retains all words, it may fail to integrate them meaningfully into responses. It might regurgitate without deep reasoning.
- However, “severe drop” is conditional:
- If the input is structured well (clear sectioning, bullet points, proper formatting), models retain context better.
- If retrieval is tested explicitly (e.g., “What was the key point in paragraph three?”), some models struggle, but Claude 3 Opus and Gemini 1.5 have shown major improvements.
- Not all models choke equally:
- GPT-4-turbo tends to stay logically consistent but may sometimes “summarize too aggressively.”
- Claude 3 Opus appears better at recalling precise details but has its own quirks.
- Grok-1, DeepSeek, and Gemini 1.5 vary in their ability to retrieve vs. reason over long contexts.
3. “Doesn’t Really Matter What Model — It Will Choke”
- The “it doesn’t matter what model” claim is too strong.
- Models have different failure points under long-context conditions.
- Some maintain coherence better but lose granularity (e.g., GPT-4-turbo).
- Some remember specifics better but struggle with reasoning (e.g., DeepSeek).
- Some handle retrieval explicitly better (e.g., Claude 3 Opus).
- It also depends on the kind of long-context use case:
- Reading comprehension? (Claude and Gemini handle this well.)
- Maintaining narrative over long passages? (GPT-4 is generally strong.)
- Pulling specific details verbatim? (DeepSeek can do better in some cases.)
- Sustaining coherent multi-turn discourse over hours? (Most models degrade, but Opus does better than others.)
- So, saying “every model chokes equally” is incorrect — they fail in different ways.
4. Long Context is Improving — But Not Perfect
- Current approaches to long-context retention:
- Extended Context Windows (Claude 3, Gemini 1.5) — bigger token windows help, but they don’t solve deeper retrieval/reasoning failures.
- Memory-like Approaches (Anthropic’s experiments, GPT’s memory features) — promising, but early-stage.
- Sparse Attention Techniques (DeepMind’s work, new architectures) — might fix some degradation issues.
- There is progress, but LLMs still struggle when the context is poorly structured or retrieval is deeply implicit.
Final Thought: The Claim is Oversimplified
- Yes, LLMs degrade in long contexts, and yes, they sometimes choke.
- BUT the severity depends on:
- Model choice
- Type of long-context task
- Input structure and prompt engineering
- Whether degradation is gradual or sudden/catastrophic.
So, a more accurate restatement of the claim would be:
“LLMs still struggle with long contexts, with performance degradation that varies by model, task, and input structure. Some models handle long contexts better than others, but no model completely avoids issues like recency bias, compression errors, or hallucination.”
This acknowledges the core problem while avoiding the overly strong generalization.
Me: Here’s more context pun intended. Look at the image I uploaded, and provide your thoughts and feedback.
ChatGPT:
I can analyze the image directly. Here’s my breakdown of what it shows and how it relates to the claim that “LLMs are still incredibly bad at long context.”
1. Core Takeaways from the Image
- The graph and table present results from the NoLiMa benchmark, which evaluates how well LLMs retain and utilize long-context information.
- The x-axis represents context length (in tokens), going from 1K to 32K.
- The y-axis represents the performance score (NoLiMa), showing how well the models perform on a long-context task.
- The trend: All models degrade in performance as context length increases, but some degrade more sharply than others.
- Observations from the Table:
- GPT-4o performs the best overall, retaining a high score up to 32K tokens.
- Claude 3.5 Sonnet and Gemini 1.5 Flash degrade significantly past 8K tokens.
- Llama models struggle a lot with long context, losing significant accuracy by 8K tokens.
- Some models effectively fail beyond 4K or 8K tokens, with scores dropping below 50% of their base scores.
2. What This Confirms About LLMs and Long Context
- “LLMs don’t perform well on long context” → True, but with nuance.
- Performance does degrade, but some models (GPT-4o, to an extent Claude 3.5) hold up better than others.
- Not all models “choke” in the same way — some gradually degrade, others collapse quickly.
- “Perfect needle-in-the-haystack scores are easy — attention mechanisms can match the word” → Makes sense.
- Simple keyword retrieval is easy for LLMs because attention mechanisms are optimized for it.
- However, reasoning over long spans of text (1-hop inference) is much harder. This aligns with the sharp drop in performance.
- “When you require 1-hop of reasoning, performance degrades quickly” → Supported by data.
- The sharp decline in scores past 4K to 8K tokens suggests that LLMs are struggling to reason over long spans of information.
- This implies that memory compression techniques or attention limits are a bottleneck.
- “This is why guaranteeing correctness for agents is hard” → Completely valid.
- If models struggle with retrieving and reasoning over long documents, then AI agents relying on them will be unreliable for tasks requiring long-term memory or detailed multi-step reasoning.
3. How This Fits Into the Original Claim
- “LLMs are still incredibly bad at long context” → True in some ways, but relative to task and model.
- If “bad” means losing accuracy beyond a few thousand tokens, then yes, most models do degrade.
- If “bad” means completely useless, then no — GPT-4o and Claude 3.5 Sonnet still retain meaningful performance up to a reasonable threshold.
- “Severe drop in response quality” → This is confirmed.
- The table highlights models dropping below 50% of their base performance past a certain context length.
- Some models like Llama 3.1 8B and Gemini 1.5 Flash degrade catastrophically beyond 8K tokens.
- “Doesn’t really matter what model — it will choke” → Not quite true.
- Some models choke faster (Llama, Gemini 1.5 Flash), while others (GPT-4o, Claude 3.5 Sonnet) degrade more gracefully.
- GPT-4o still performs above 80% at 32K tokens — so it hasn’t fully choked.
4. Alternative Framing Based on Data
Instead of the blanket statement that “LLMs are incredibly bad at long context,” a more accurate summary would be:
“LLMs degrade significantly with long context, with most models struggling beyond 8K tokens. Simple retrieval is relatively easy, but multi-hop reasoning over long spans is still a major weakness. Some models (GPT-4o, Claude 3.5 Sonnet) retain reasonable performance, but others collapse quickly. Long-context reliability remains a bottleneck for AI agents requiring deep memory retention.”
Final Thoughts
- The research supports long-context struggles, but not all models fail equally.
- GPT-4o is the standout performer — it retains high performance even at 32K tokens.
- Reasoning over long context is the hardest challenge — not simple keyword retrieval.
- AI agents needing high-context recall will still struggle, though future models may improve with better memory architectures.
This aligns with the broader point that LLMs haven’t fully solved long-context reasoning, but the best models are far from useless — they just require structured, optimized input to maintain performance.
Me: What do you think can be done to improve this issue that LLMs haven’t fully solved long-context reasoning.
ChatGPT:
Improving long-context reasoning in LLMs is a critical challenge, but there are multiple promising directions. Here’s a deep dive into what can be done, breaking it down into architectural improvements, memory enhancements, training strategies, and hybrid approaches.
1. Architectural Improvements: Fixing Attention Limitations
One of the core issues with long-context LLMs is that Transformer-based models struggle with memory retention due to quadratic attention complexity. This means as context grows, models have to allocate more computational power, leading to degradation. Here are some possible solutions:
A. Better Attention Mechanisms
- Sparse Attention Mechanisms: Instead of full quadratic self-attention, sparse attention techniques (e.g., Reformer, Longformer, BigBird) can selectively focus on key parts of the context.
- Mixture of Local and Global Attention: Allow models to attend to both nearby tokens (local context) and important key tokens from far away (global context) instead of processing everything equally.
- Sliding Window Attention: Instead of maintaining full-context awareness, models could maintain a dynamic rolling window of the most relevant sections.
B. Recursive or Hierarchical Processing
- Instead of processing all tokens at once, LLMs could break text into smaller chunks, process each, and then merge findings into a higher-level representation.
- This could work similarly to how humans summarize and recall key ideas rather than remembering every single detail verbatim.
2. Memory and Retrieval-Augmented Models
Even with better attention mechanisms, raw long-context retention isn’t enough. We need a memory system that allows models to recall relevant information selectively, rather than struggling with thousands of tokens at once.
A. Memory-Enhanced Models
- Instead of just “forgetting” past tokens after the context window limit is exceeded, models could have a dynamic, learnable memory buffer.
- This buffer would store compressed, structured representations of past interactions instead of full verbatim recall.
- Anthropic’s Claude memory experiments and OpenAI’s GPT-4-turbo with memory are early steps in this direction.
B. Retrieval-Augmented Generation (RAG)
- Instead of forcing a model to process all 200K tokens directly, why not retrieve relevant portions dynamically?
- Vector databases (e.g., FAISS, Pinecone) could store embeddings of large texts, and retrieval algorithms could extract only the relevant chunks to process.
- This is already in use for LLMs powering chatbots, search, and enterprise applications.
C. Hybrid Symbolic + Neural Approaches
- One interesting idea is integrating symbolic reasoning (explicit logic structures) with LLMs.
- This would allow facts, rules, or logical structures to persist independently of raw token memory.
- For example, an LLM could store an internal knowledge graph and reference it dynamically instead of relying on pure token-based recall.
3. Better Training Strategies
Even with architectural and memory improvements, training methodologies need to evolve to support long-context reasoning.
A. Curriculum Learning for Long Contexts
- Instead of just throwing random large texts at LLMs during training, why not gradually increase the context difficulty?
- This would involve:
- Short-context tasks first (1K-2K tokens)
- Mid-range contexts (8K-16K tokens)
- Ultra-long context understanding (100K tokens+)
- By structuring training in a step-wise fashion, models would learn to handle longer contexts more naturally.
B. Multi-Hop Reasoning Training
- Current models struggle with reasoning across long texts. We need explicit training on:
- Retrieving information across long documents
- Inferring relationships between distant facts
- Synthesizing complex narratives over thousands of tokens
- Benchmark datasets (e.g., NoLiMa, LongBench) could be used to fine-tune models specifically for long-context comprehension.
C. Reinforcement Learning for Long-Term Coherence
- Models like GPT-4 and Claude 3 already use Reinforcement Learning from Human Feedback (RLHF) to improve responses.
- A new approach could involve RLHF for long-context tracking — rewarding models for maintaining coherent narratives over long interactions.
- This could involve:
- Rewarding models that correctly recall information from earlier in a conversation.
- Penalizing hallucinations or inconsistencies when recalling long-term details.
4. Hybrid Systems: Combining LLMs with External Tools
Instead of relying purely on LLMs for long-context recall, a hybrid approach could involve delegating memory and retrieval tasks to specialized systems.
A. AI Orchestration with Contextual Assistants
- Rather than forcing a single LLM to handle all long-context processing, break it into specialized components:
- Long-Term Memory Module (stores and retrieves relevant historical context)
- Reasoning Module (performs logic and inference over long-range data)
- Summarization Module (condenses and refines extended conversations)
- A central “agent” could coordinate these submodules, calling them as needed.
B. External Knowledge Graph Integration
- Instead of treating long texts as raw tokens, AI could interact with structured knowledge graphs (like Wikidata, ArangoDB).
- This would allow fact recall to be done via explicit relationships, not just token attention mechanisms.
C. Multi-Agent LLM Systems
- Imagine instead of one LLM struggling to retain context, you had multiple models handling different tasks:
- A “recall model” could handle memory storage.
- A “reasoning model” could focus on logical inference.
- An “editing model” could refine and condense long texts dynamically.
This multi-agent approach would mimic how humans collaborate — where different cognitive systems work together to handle complexity.
5. Evaluating Success: New Benchmarks for Long Context
- Current benchmarks (NoLiMa, LongBench, HELM) focus on basic long-context retention.
- We need better metrics for:
- Multi-hop reasoning across long documents
- Multi-turn dialogue coherence over extended interactions
- Context-aware retrieval performance
- Hallucination detection and long-term factual recall
- These metrics should directly influence how models are trained and optimized.
Final Thoughts: The Future of Long-Context LLMs
What’s Next?
Short-Term (6–12 months) → Memory-enhanced LLMs and improved retrieval
- More Claude-style memory systems.
- Better hybrid RAG models that dynamically retrieve long documents.
- Minor upgrades to GPT-4.5 / Claude 3.5 / Gemini Ultra to improve recall.
Mid-Term (1–2 years) → Architectural breakthroughs & hierarchical memory
- Improved sparse attention models (Mixture of Local and Global Attention).
- Hierarchical LLMs that process text in structured chunks.
Long-Term (3–5 years) → True AI assistants with persistent memory
- LLMs with long-term persistent memory that updates over time.
- Multi-agent models working together for superior recall and reasoning.
- Hybrid neuro-symbolic models combining neural networks with structured knowledge.
Conclusion
Long-context reasoning is a hard problem, but solvable with: ✅ Better architectures (sparse attention, hierarchical processing)
✅ Memory-enhanced models (retrieval-based LLMs, persistent memory)
✅ Improved training techniques (curriculum learning, reinforcement for long-context coherence)
✅ Hybrid AI systems (multi-agent coordination, structured knowledge graphs)
If these improvements are implemented, LLMs won’t just “choke” on long contexts anymore — they’ll become true long-term reasoning assistants.
Me: Think slow and answer the question again. Also create a table of the doability of what is best to solve the issue from best to worst, rate them 1–10, and reasons why
ChatGPT:
⌛ Slow Thinking Activated
<contemplator> Alright, let’s really slow down and **think through** the problem of LLMs struggling with long-context reasoning. Instead of rushing toward obvious solutions, let’s **question the core assumptions**, map out different solutions, and evaluate their feasibility.
Step 1: Understanding the Core Issue
The main problem is that LLMs degrade in performance as context length increases. But why?
The Transformer Bottleneck
- Transformers use self-attention, which requires comparing every token to every other token in a given sequence.
- This means more tokens → exponentially higher computational cost.
- Eventually, the model struggles to process everything meaningfully.
Memory Retention Issues
- Even though modern models like GPT-4o claim 128K+ token support, they don’t truly retain all of it in a meaningful way.
- The model tends to prioritize recent tokens over older ones (recency bias).
- Older information gets compressed or ignored, making multi-step reasoning difficult.
Retrieval vs. Understanding
- Simple look-up retrieval (needle-in-haystack tasks) works well.
- Multi-hop reasoning over long spans (connecting ideas from different parts of a document) fails rapidly.
- This means models might “remember” details but fail to reason with them properly.
Step 2: Possible Solutions
We now have four main categories of possible solutions:
- Architectural Improvements (changing how models process information)
- Memory and Retrieval-Augmented Models (storing and retrieving context more effectively)
- Better Training Strategies (teaching models to handle long-context better)
- Hybrid AI Systems (combining LLMs with structured tools)
Step 3: Evaluating Solutions — Best to Worst
To determine what’s best, let’s evaluate each approach based on:
✅ Effectiveness — How well it addresses the issue
✅ Doability — How feasible it is to implement in current models
✅ Scalability — Will it work as contexts get even longer?
✅ Computational Cost — How expensive it is to run
I’ll rank them from best (most doable + effective) to worst and assign a score (1–10) for doability.
Step 5: Best Path Forward
Short-Term (6–12 months)
- Improve retrieval-based models (RAG)
- Enhance memory-augmented training
- Use curriculum learning to extend context window naturally
Mid-Term (1–2 years)
- Sparse Attention becomes standard
- Hierarchical Processing improves multi-step reasoning
- More structured retrieval techniques emerge
Long-Term (3–5 years)
- Fundamental architecture shifts (Neuro-symbolic AI, Multi-agent LLMs)
- Fully dynamic memory-aware AI assistants
Final Thoughts
✅ Memory-Augmented Models & Retrieval-Based Methods (RAG) are the best short-term solutions.
✅ Architectural tweaks (Sparse Attention, Chunking) will improve efficiency.
✅ Radical changes (Neuro-symbolic AI) are exciting but far off.
If we implement these improvements incrementally, long-context failures won’t be a permanent issue — they’ll just be another growing pain in the evolution of AI.
</contemplator>