Back to Blog

Facing the Token Tide: How Increasing Input Tokens Impacts LLM Performance

Harsh Vardhan Goswami

Jul 25, 2025

Product Development

Large Language Models (LLMs) are breaking barriers with their ability to process input sequences that stretch to millions of tokens. This breakthrough unlocks opportunities—from analyzing entire novels to navigating vast chat logs or codebases at once. But there’s a catch: groundbreaking research and benchmarks reveal that simply feeding LLMs more data doesn’t guarantee steady performance. In fact, as input size balloons, models often become less accurate and reliable in unpredictable, nonlinear ways—a phenomenon experts call "context rot." As we push the boundaries of context length, understanding these limitations is critical for building dependable AI solutions.

This report digs into the subtleties of “context rot,” spotlighting results from a major study involving 18 LLMs—including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3. By designing tasks that kept everything constant except input length, it uncovers how increasing the number of input tokens alone can affect performance. The findings offer practical insights for developers, AI enthusiasts, and product teams who work with LLMs in real-world settings, especially when prompt size and complexity can vary so much.

The Core Challenge: Nonuniform Use of Context

Contrary to common assumptions, LLMs do not treat all parts of their context window equally. Instead of uniform attention and processing, model quality and response fidelity degrade unpredictably as inputs lengthen, even on simple tasks. This degradation manifests in:

Increased hallucinations or incorrect answers
Lower precision in retrieval tasks
More frequent task refusal or unstable outputs

The root cause lies in how LLMs internally handle long sequences, struggle with distinguishing relevant from irrelevant information, and suffer from architectural and optimization limitations when scaling context.

Experimental Setup

Models Evaluated

18 LLMs spanning commercial closed-source and open-weight open-source families
Notable models: GPT-4.1 (and variants), Anthropic Claude series, Gemini 2.5 Pro, Qwen3 family
Both standard and “thinking modes” enabled for models supporting reasoning control

Task Design

To isolate input length as the key variable, experiments maintained task complexity constant while varying only the input size across eight length increments and multiple needle (target info) positions. Tasks included:

Needle in a Haystack (NIAH): Retrieve a known “needle” sentence embedded inside lengthy distractor and irrelevant text dubbed the “haystack.”
Extension to Semantic Matching: Needles/questions that require latent semantic associations, not just lexical matches.
Conversational QA Benchmark (LongMemEval): Queries on very long chat histories with relevant and irrelevant content intermixed.
Repeated Words Replication: Simple synthetic task where LLM replicates long inputs with a uniquely placed token, testing output reliability with growing output length.

Evaluation Metrics

Accuracy judged via aligned GPT-4.1 judges calibrated to human evaluators (>99% alignment).
Analysis of hallucination rates, refusal to answer, positional accuracy of unique tokens.
Impact of distractors, needle-question similarity, and haystack structure examined statistically.

Key Findings and Insights

1. Performance Declines With Increasing Input Length

Across all experiments, model accuracy consistently declines with increasing input length, regardless of task simplicity.
Early input lengths show robust performance; degradation accelerates beyond several thousand tokens.
Even autoregressive output generation tasks, where output length scales with input, exhibit substantial non-uniformity in correctness as length grows.

2. Needle-Question Semantic Similarity Influences Degradation Rate

Tasks with higher semantic similarity (close meaning between query and answer) degrade slower.
Lower similarity pairs, more reflective of real-world ambiguous queries, see sharper accuracy drops as context length increases.
This demonstrates the compound challenge posed by ambiguous and semantically complex information retrieval in long contexts.

3. Distractors Amplify Errors and Hallucinations

Introducing distractors—plausible but irrelevant or incorrect text—reduces accuracy in a non-uniform manner.
The impact intensifies with more distractors and longer contexts.
Some distractors are more pernicious, triggering more hallucinations depending on model family.
Claude models tend to be more conservative, abstaining when uncertain, whereas GPT models commonly hallucinate confidently.

4. Needle-Haystack Content Similarity Shows Non-Uniform Effects

Whether the needle blends semantically with the haystack influences difficulty but not predictably.
In some haystacks, needles dissimilar to the surrounding content were easier to identify.
This underlines the role of content diversity and contextual distinctiveness in long-context processing.

5. Haystack Structural Coherence Reduces Performance

Contrary to intuition, models do better when the haystack’s sentences are shuffled rather than logically ordered.
Preserving narrative or argumentative flow introduces additional processing challenges for attention mechanisms.
This suggests that models struggle with extended dependencies and global coherence in long inputs.

6. LongMemEval Chat History Tests Highlight Retrieval and Reasoning Costs

Models perform much better on “focused” prompts containing only relevant context than on full prompts with irrelevant chatter.
Adding irrelevant history forces models to simultaneously search for relevant info and reason over it, degrading performance.
Behavior patterns vary: conservative abstentions in Claude versus hallucination-prone confidence in GPT models under ambiguity.

7. Repeated Words Experiment Underlines Long-Output Challenges

Models degrade in replicating long sequences exactly, with accuracy dropping and positional errors growing at higher token counts.
Over- and under-generation alongside refusals become common, indicating difficulty maintaining precise long outputs.
Position of unique tokens near the beginning boosts accuracy; late positions often cause failure.

Practical Implications for Developers and AI Product Teams

Context Engineering Is Critical

How relevant info is selected, organized, and inserted into prompts strongly shapes LLM reliability.
Simply maximizing input size risks performance drops and erratic outputs.

Manage Distractors and Irrelevant Contexts

Filter out or flag non-essential content to reduce hallucination risk.
Consider prompt chunking and incremental retrieval strategies.

Don’t Trust Benchmark Scores Blindly

Widely used benchmarks (e.g., NIAH) can overestimate real-world long-context ability because they favor simple lexical match retrieval.
Use domain-specific, semantically rich, and distractor-inclusive tests.

Consider Model Behavior Differences for Application Design

Claude’s conservative abstentions might suit high-stakes or compliance-sensitive use cases.
GPT’s creative hallucinations might be controlled via prompt design or reasoning modes.

Experiment with Input Ordering

Breaking logical sequence by shuffling or chunking inputs may counterintuitively improve some retrieval tasks.

Monitoring and Testing

Track performance across input length variations in your use cases to identify failure thresholds.
Fine-tune or adjust prompt strategies based on observed degradations.

Summary Table of Findings

Aspect	Observation	Developer Action
Input Length	Performance degrades non-linearly	Limit or chunk input size
Needle-Question Similarity	Lower similarity = faster degradation	Prefer high semantic matching
Distractors	Increase hallucinations	Minimize & separate distractors
Haystack Structure	Shuffling improves accuracy	Experiment with input ordering
Model-specific Behavior	Conservative vs hallucination patterns	Choose model per risk tolerance
Output Length (Repeated Words)	Accuracy drops at scale	Monitor and segment output length

Conclusion

The promise of effectively using extremely long input contexts in LLMs faces significant challenges due to context rot: the uneven and accelerating decline in model performance as input length increases. These findings underscore that the quantity of input tokens is not the sole determinant of quality—how the context is constructed, filtered, and presented is equally, if not more, vital.

For teams building AI-powered products that require languaging over long documents, chat histories, or complex data sets, understanding and engineering around these limitations is essential. Developing effective prompt strategies, employing semantic similarity techniques, and carefully handling distractors can mitigate context rot and improve real-world reliability.

This work provides a comprehensive foundation for approaching long-context LLM deployment pragmatically, with an emphasis on careful context engineering and evaluation beyond simplistic benchmarks.

Read the detailed report here at: Context Rot: How Increasing Input Tokens Impacts LLM Performance

‹ Tracking Schema Changes in Production with PostgreSQL Event Triggers

Migrating from MongoDB to Elasticsearch with Monstache: Achieving Real-Time, Low-Latency Search ›