Harsh Vardhan Goswami
Jul 25, 2025
Large Language Models (LLMs) are breaking barriers with their ability to process input sequences that stretch to millions of tokens. This breakthrough unlocks opportunities—from analyzing entire novels to navigating vast chat logs or codebases at once. But there’s a catch: groundbreaking research and benchmarks reveal that simply feeding LLMs more data doesn’t guarantee steady performance. In fact, as input size balloons, models often become less accurate and reliable in unpredictable, nonlinear ways—a phenomenon experts call "context rot." As we push the boundaries of context length, understanding these limitations is critical for building dependable AI solutions.
This report digs into the subtleties of “context rot,” spotlighting results from a major study involving 18 LLMs—including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3. By designing tasks that kept everything constant except input length, it uncovers how increasing the number of input tokens alone can affect performance. The findings offer practical insights for developers, AI enthusiasts, and product teams who work with LLMs in real-world settings, especially when prompt size and complexity can vary so much.

The Core Challenge: Nonuniform Use of Context
Contrary to common assumptions, LLMs do not treat all parts of their context window equally. Instead of uniform attention and processing, model quality and response fidelity degrade unpredictably as inputs lengthen, even on simple tasks. This degradation manifests in:
Increased hallucinations or incorrect answers
Lower precision in retrieval tasks
More frequent task refusal or unstable outputs
The root cause lies in how LLMs internally handle long sequences, struggle with distinguishing relevant from irrelevant information, and suffer from architectural and optimization limitations when scaling context.
Experimental Setup
Models Evaluated
18 LLMs spanning commercial closed-source and open-weight open-source families
Notable models: GPT-4.1 (and variants), Anthropic Claude series, Gemini 2.5 Pro, Qwen3 family
Both standard and “thinking modes” enabled for models supporting reasoning control
Task Design
To isolate input length as the key variable, experiments maintained task complexity constant while varying only the input size across eight length increments and multiple needle (target info) positions. Tasks included:
Needle in a Haystack (NIAH): Retrieve a known “needle” sentence embedded inside lengthy distractor and irrelevant text dubbed the “haystack.”
Extension to Semantic Matching: Needles/questions that require latent semantic associations, not just lexical matches.
Conversational QA Benchmark (LongMemEval): Queries on very long chat histories with relevant and irrelevant content intermixed.
Repeated Words Replication: Simple synthetic task where LLM replicates long inputs with a uniquely placed token, testing output reliability with growing output length.
Evaluation Metrics
Accuracy judged via aligned GPT-4.1 judges calibrated to human evaluators (>99% alignment).
Analysis of hallucination rates, refusal to answer, positional accuracy of unique tokens.
Impact of distractors, needle-question similarity, and haystack structure examined statistically.
Key Findings and Insights
1. Performance Declines With Increasing Input Length
Across all experiments, model accuracy consistently declines with increasing input length, regardless of task simplicity.
Early input lengths show robust performance; degradation accelerates beyond several thousand tokens.
Even autoregressive output generation tasks, where output length scales with input, exhibit substantial non-uniformity in correctness as length grows.

2. Needle-Question Semantic Similarity Influences Degradation Rate
Tasks with higher semantic similarity (close meaning between query and answer) degrade slower.
Lower similarity pairs, more reflective of real-world ambiguous queries, see sharper accuracy drops as context length increases.
This demonstrates the compound challenge posed by ambiguous and semantically complex information retrieval in long contexts.
3. Distractors Amplify Errors and Hallucinations
Introducing distractors—plausible but irrelevant or incorrect text—reduces accuracy in a non-uniform manner.
The impact intensifies with more distractors and longer contexts.
Some distractors are more pernicious, triggering more hallucinations depending on model family.
Claude models tend to be more conservative, abstaining when uncertain, whereas GPT models commonly hallucinate confidently.
4. Needle-Haystack Content Similarity Shows Non-Uniform Effects
Whether the needle blends semantically with the haystack influences difficulty but not predictably.
In some haystacks, needles dissimilar to the surrounding content were easier to identify.
This underlines the role of content diversity and contextual distinctiveness in long-context processing.
5. Haystack Structural Coherence Reduces Performance
Contrary to intuition, models do better when the haystack’s sentences are shuffled rather than logically ordered.
Preserving narrative or argumentative flow introduces additional processing challenges for attention mechanisms.
This suggests that models struggle with extended dependencies and global coherence in long inputs.
6. LongMemEval Chat History Tests Highlight Retrieval and Reasoning Costs
Models perform much better on “focused” prompts containing only relevant context than on full prompts with irrelevant chatter.
Adding irrelevant history forces models to simultaneously search for relevant info and reason over it, degrading performance.
Behavior patterns vary: conservative abstentions in Claude versus hallucination-prone confidence in GPT models under ambiguity.
7. Repeated Words Experiment Underlines Long-Output Challenges
Models degrade in replicating long sequences exactly, with accuracy dropping and positional errors growing at higher token counts.
Over- and under-generation alongside refusals become common, indicating difficulty maintaining precise long outputs.
Position of unique tokens near the beginning boosts accuracy; late positions often cause failure.
Practical Implications for Developers and AI Product Teams
Context Engineering Is Critical
How relevant info is selected, organized, and inserted into prompts strongly shapes LLM reliability.
Simply maximizing input size risks performance drops and erratic outputs.
Manage Distractors and Irrelevant Contexts
Filter out or flag non-essential content to reduce hallucination risk.
Consider prompt chunking and incremental retrieval strategies.
Don’t Trust Benchmark Scores Blindly
Widely used benchmarks (e.g., NIAH) can overestimate real-world long-context ability because they favor simple lexical match retrieval.
Use domain-specific, semantically rich, and distractor-inclusive tests.
Consider Model Behavior Differences for Application Design
Claude’s conservative abstentions might suit high-stakes or compliance-sensitive use cases.
GPT’s creative hallucinations might be controlled via prompt design or reasoning modes.
Experiment with Input Ordering
Breaking logical sequence by shuffling or chunking inputs may counterintuitively improve some retrieval tasks.
Monitoring and Testing
Track performance across input length variations in your use cases to identify failure thresholds.
Fine-tune or adjust prompt strategies based on observed degradations.
Summary Table of Findings
Aspect | Observation | Developer Action |
---|---|---|
Input Length | Performance degrades non-linearly | Limit or chunk input size |
Needle-Question Similarity | Lower similarity = faster degradation | Prefer high semantic matching |
Distractors | Increase hallucinations | Minimize & separate distractors |
Haystack Structure | Shuffling improves accuracy | Experiment with input ordering |
Model-specific Behavior | Conservative vs hallucination patterns | Choose model per risk tolerance |
Output Length (Repeated Words) | Accuracy drops at scale | Monitor and segment output length |
Conclusion
The promise of effectively using extremely long input contexts in LLMs faces significant challenges due to context rot: the uneven and accelerating decline in model performance as input length increases. These findings underscore that the quantity of input tokens is not the sole determinant of quality—how the context is constructed, filtered, and presented is equally, if not more, vital.
For teams building AI-powered products that require languaging over long documents, chat histories, or complex data sets, understanding and engineering around these limitations is essential. Developing effective prompt strategies, employing semantic similarity techniques, and carefully handling distractors can mitigate context rot and improve real-world reliability.
This work provides a comprehensive foundation for approaching long-context LLM deployment pragmatically, with an emphasis on careful context engineering and evaluation beyond simplistic benchmarks.
Read the detailed report here at: Context Rot: How Increasing Input Tokens Impacts LLM Performance