Context rot: why your AI agent gets dumber the longer it runs

Long-running AI agents degrade over time as the context window fills with noise, repeated instructions, and stale data. Here's how to detect context rot and the three patterns that actually fix it.

Here's something you'll notice after running AI agents in production for a few weeks: a fresh conversation with your agent is sharp. Give that same agent 40 messages of history and it starts contradicting earlier decisions, forgetting constraints, and producing worse output than it did at the start of the session.

It's not random. It's structural. The context window is a fixed-size working memory, and you're filling it with noise.

I call this context rot — the gradual degradation of agent performance as accumulated context crowds out the signal with stale data, repeated boilerplate, and irrelevant turns. Here's what causes it, how to measure it, and three patterns that genuinely fix it.

---

What's actually happening

Language models have no persistent memory between calls. Every request is a fresh inference over the entire sequence of tokens you provide. The "memory" is entirely the context window.

This creates a few failure modes as conversations grow:

**1. Recency bias in attention.** Transformer attention isn't uniformly distributed across the context. Empirically, models tend to weight recent tokens and the very beginning of the context more heavily than the middle — often called the "lost in the middle" phenomenon. Important instructions from turn 3 may be functionally invisible by turn 35.

**2. Instruction dilution.** Your system prompt says "always respond in JSON." By turn 20, there are 19 examples of the model responding in prose (because the user asked follow-up questions in natural language). The prose examples carry weight. The model's priors shift.

**3. Stale state pollution.** The agent made a decision at turn 8 based on facts that were true then. By turn 30, those facts have changed — but the reasoning from turn 8 is still in context, silently influencing everything downstream.

**4. Token budget pressure.** As the context fills toward the model's maximum, the model may start truncating its own reasoning, cutting corners, or producing shorter, lower-quality outputs to stay within limits.

---

How to detect it

Before applying any fix, confirm you actually have context rot. The simplest test:

python

import anthropic

client = anthropic.Anthropic()

def test_instruction_following(history: list[dict], probe: str) -> str:
    """
    Send a known-format probe at a given conversation length.
    If the model's compliance rate drops as history grows, you have context rot.
    """
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=256,
        system="CRITICAL: Always respond in valid JSON with exactly these fields: {result: string, confidence: number}",
        messages=history + [{"role": "user", "content": probe}]
    )
    raw = response.content[0].text
    try:
        import json
        data = json.loads(raw)
        return "valid" if {"result", "confidence"}.issubset(data.keys()) else "invalid_schema"
    except json.JSONDecodeError:
        return "not_json"

# Run the

Read full article on dev.to

// related articles

Never trust an LLM's output directly. Here's the validation layer I put on every agent.

dev.toJul 1

Quoting Anthropic

Simon WillisonJun 30

Import AI 463: Self-improving robots; a 10k Chinese GPU cluster; and an elegiac essay for the human era

Import AIJun 29