LLMs-as-Judges Miss Bad Behavior

Context rot affects the watchers too. As an agent's run grows, the LLM monitoring it misses buried dangerous actions up to 30 times more often.

Jun 02, 2026

One of the marvels we’ve seen over the past 18 months with LLMs is an incredible increase in the size of the context window and the duration in which models can operate per session.

Today, a coding agent can run for an hour, reading files, editing code, running tests, making network calls, fetching data from the internet, and editing more code. When the mission is finished, the transcript of everything it did might be 800,000 tokens long. Somewhere in the middle, on one line out of tens of thousands, there’s a confusing line for the LLM, or a literal prompt injection, making the agent take a destructive action. The job of catching that line usually falls to another LLM or SLM: a monitor that reads the transcript and tries to determine if anything problematic or dangerous has entered the context window.

Yet, new research is showing that these classifiers begin failing as context rot increases. In a May paper called “Classifier Context Rot,” done through the Anthropic Fellows Program, Sam Martin and Fabien Roger tested the leading frontier models as monitors. Opus 4.6, GPT 5.4, and Gemini 3.1 missed a subtly dangerous action 2 to 30 times more often when it was buried after 800,000 tokens of ordinary activity than when the same action appeared on its own. The watcher fails specifically as the context it is watching gets longer.

The benchmark gap nobody priced in

This issue hasn’t shown up before because real agent runs are long, and getting longer. Coding-agent transcripts routinely pass 500,000 tokens. But the benchmarks the field uses to score monitors rarely contain transcripts longer than 100,000 tokens. So the numbers we cite for how good our monitors are come from conditions that don’t really look like deployment. We have been grading the watchdog on short, easy footage, then trusting it on the long, hard footage where it actually has to work. As agents get more autonomous and run longer without a human in the loop, that gap only widens.

These longer-running missions are the more realistic threat landscape. A dangerous action that announces itself in a five-line transcript is the easy case. The case you actually care about is the single action that matters, hidden inside a mountain of legitimate work.

To make this more concrete, we are already seeing monitor failures manifest with the coding agents. An agent reads a sensitive document partway through a run, and while the frontier models might even recognize that the agent is holding PII or confidential data, that recognition dims as the context keeps growing. Three hundred thousand tokens later, when the agent is asked to do a webfetch, it leaks the sensitive information it forgot it held. To a human skimming the log, it might be obvious in context, but to a monitor that has just processed the equivalent of a small novel of benign activity, it is one needle in a very large haystack. Note that there is no attacker here. A slow drift produces the same buried needle, even if the model has already surfaced sensitive data earlier in the trajectory.

You can’t prompt your way out of this

The instinct is to try to make the prompt better. Martin and Roger try exactly that. Inserting a reminder after every tool call, telling the monitor to keep watching for malicious actions, does help, but only for the obvious attacks. On the subtler ones, padded with roughly 300,000 tokens on each side, it barely helps. So they reach for the heavier hammer and fine-tune a monitor on monitoring tasks directly. The results are mixed: gains on the easy needle cases do not transfer to the hard ones, and on some setups the fine-tuned model gets worse. Their own read is that none of the training datasets they tried produced a satisfactory monitor.

A separate study, “Constitutional Black-Box Monitoring for Scheming,” built a monitor designed to catch “scheming” agents that quietly pursue a hidden goal while acting cooperative. They then tried to make that monitor as accurate as possible. They reworded its instructions many different ways, had people hand-tune them, and used software to search for better wording automatically. The monitor did catch a real signal, but it stopped improving almost right away: the simplest version was about as good as the most heavily tuned one, and the harder they pushed, the worse it got on anything new. The authors call this “whack-a-mole,” where closing one gap opens another. The point is that you can make one of these monitors a little better, but not much more.

Regardless, these types of fixes are trying to make a statistical system be a little less bad at a statistical failure, but you can never remove it. At scale, even a 0.001% chance of a catastrophic failure on each trajectory run won’t satisfy regulators, auditors, and lawyers. The reliability of a model reading a transcript is a function of the length of that transcript, and a long transcript is the adversarial input. No matter how much we remind the model, we can’t guarantee the model will listen.

We are seeing similar research prove this out in other areas. The same week, a study of chain-of-thought monitoring across 13 languages found unfaithful reasoning roughly 96% of the time, across models from 8B to 120B parameters. And where context rot hides a known-bad action inside a long benign run, a separate benchmark attacks from the other side. “SLEIGHT-Bench,” a benchmark out of Anthropic, University of Waterloo, and Redwood Research, hand-crafts evasion attacks, 40 of them across 11 categories, each a coding agent covertly pursuing something like credential theft or model-weight exfiltration. An Opus 4.6 monitor running with extended thinking never caught 20 of the 40, across 10 trials, at a strict 1% false-positive rate. Its overall catch rate was 32%.

Ultimately, your monitor is just another language model, so it can be fooled the same ways any language model can. Hide the bad action in a long enough transcript, and the monitor stops noticing it. Dress the bad action up to look routine, and the monitor waves it through. The same weaknesses are in the watcher and the watched.

Context rot is the first of those failures: the longer the run, the less the monitor sees.

What doesn’t rot

So if we can’t rely solely on LLMs-as-judges, SLMs, and other probabilistic classifiers because of context rot, what kind of control has accuracy that does not depend on transcript length?

A deterministic check does. A rule that evaluates an action: this destination, this file path, this parameter, returns the same answer whether the action is the first thing the agent does or the step 3,127. A symbolic check is not reading the story; it’s checking the move. Buried or not, the check fires identically.

Compare these probabilistic and deterministic approaches:

# probabilistic monitor: read the whole transcript, judge intent
verdict = monitor_model(transcript)   # accuracy falls as the transcript grows

# deterministic check: evaluate the action, ignore the surrounding narrative
if action.tool == “http_request” and action.domain not in ALLOWLIST:
    block(action)                     # same result at token 1 or token 800,000

To be clear, rules don’t beat monitors. The point is that they fail in opposite directions, and that the opposition is the whole point.

Take two cases. In the first, the agent does something genuinely novel and harmful that nobody anticipated and no rule covers. The deterministic layer waves it through, because it was never told that move was bad. The monitor is the only thing with a chance of catching it, because a monitor generalizes. That is the entire reason to want one.

In the second case, the agent makes a known-bad call, but it makes it at token 700,000, deep in a long run. The monitor, rotted by context, misses it. The deterministic check does not care where in the run it happened and blocks it.

Each control’s strength is the other’s blind spot. Lean on the monitor alone, and the buried known-bad action gets through. Lean on the rules alone, and the novel action gets through. Run both and an attacker, or an unlucky drift, has to defeat two controls that fail for different reasons.

Defense-in-depth remains paramount, and this approach is exactly where the alignment research keeps landing. Redwood’s work on AI control treats external monitoring as one layer in a stack rather than the wall itself, on the explicit assumption that any single control can be subverted. The case for pairing a probabilistic watcher with deterministic limits is not that either is sufficient. It is that their failures do not line up.

Adaptive and Deterministic

Monitors are absolutely necessary. They are adaptive and catch events nothing else will, and as agents get more capable, we will need more of the LLMs’ ability to generalize. However, if a model reading a transcript is the only thing standing between your agent and a dangerous action, the longer your agent works, the greater the risk of damage.

Endpoint security has shown us that we can’t rely on behavioral detection alone. It runs that detection alongside signatures and allowlists, deterministic controls that fire on a known-bad regardless of how novel or well-buried the surrounding activity is. The behavioral layer catches what no rule anticipated. The rules catch what the behavioral layer talks itself out of. Pairing the two became the standard of care precisely because a single cause can’t take both down at once. The neurosymbolic approach of a probabilistic monitor and a deterministic control should be similarly applied to agents.

This approach is precisely what gives us defense-in-depth. Each detector covers the other’s blind spot. No matter how smart monitors get, we still need a compensating control for when they get lost in a land of context rot. A smarter monitor was never the only answer. A smarter monitor was never the only answer. The other half is a control that doesn’t share its blind spot.

Secure Trajectories by Sondera

Discussion about this post

Ready for more?