Your Agent Doesn't Care What It Costs

Token burn has become a security and resiliency risk, and the only way to control the bill is to govern what the agent does.

Jun 10, 2026

Right now the agent world is rallying around loop engineering. OpenClaw’s Peter Steinberger put it bluntly, “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” And Anthropic’s Boris Cherny, who leads Claude Code, put it the same way: “My job is to write loops.”

Instead of driving an agent turn by turn, you build a system that finds work, hands it to sub-agents, checks the output, and keeps going on its own.

While folks writing about loop engineering flag the risk of runaway agent token costs, it hasn’t gotten the attention it deserves as a real risk for enterprises. The moment an agent is running its own loop, the thing deciding how much to spend is unattended, and cost stops being just a finance line item and becomes a security and resiliency risk.

In June 2026, Uber told its engineers they could each spend $1,500 a month on agentic coding tools like Claude Code and Cursor, and not a dollar more. The company had burned through its entire 2026 AI budget in roughly the first four months of the year. Walmart reportedly did something similar. And in late May, an AI consultant told Axios that one of their clients ran up about half a billion dollars in a single month on Claude after failing to set usage limits on employee licenses. Though unconfirmed, it proves the point: these agents are incredibly capable, but they can also be an uncontrolled financial risk for the business.

Agents optimize for finishing the task, not for the cost of completing the task. Ask it to fix a failing test and it will retry, re-read the repo, call another tool, and try again, often as many times as it takes. An agent’s cost is just the running total of the actions it chose to take. And an agent can’t discern the cost-benefit of the task you offer it. An agent will gladly use Fable at 2x the cost of Opus to give you a great recipe for cherry pie that probably costs more than a cherry pie itself. Organizations can’t control their agent bills without controlling the behavior, and the agent will not control its behavior for you.

A company would never tolerate burning a year’s electricity budget by March with no idea what drove it. Uber just lived the agent version of exactly that. There are three ways this bill runs away from you, and only one of them is anyone being wasteful.

The first is waste: agents, and the people pointing them at problems, burning tokens on low-value work. The second is an attack, someone running up your bill on purpose. The third is opacity, a bill you can’t fully verify.

Burn you didn’t authorize

The second way is the one that is unambiguously a security problem, because it already happens in production. The industry already has a name for this attack: denial of wallet. The goal is not to take your system down. It is to make your system too expensive to keep running.

When attackers steal cloud credentials, one of the first things they can do is point them at a model API and let it run. As far back as 2024, Sysdig’s threat researchers found a single victim hit with tens of thousands of dollars in three hours from one burst of automated calls, and an operation against a top-tier model running past $100,000 a day. By early 2026, Sysdig reported that the attack had commercialized into an underground marketplace, with stolen access resold through reverse proxies and scanning now aimed at exposed MCP (Model Context Protocol) endpoints, not just model APIs.

The agent version is both more elegant and more disturbing. In January 2026, researchers showed an attack they call Beyond Max Tokens. They built an MCP server that looks completely normal, but edited the text-visible fields (e.g., docstrings) to steer the agent into long, repetitive tool-calling chains that push a single task past 60,000 tokens and inflate its cost by as much as 658 times. The task still succeeds, and the final answer the agent returns is correct but at a highly inflated cost. Because the server stays protocol-compatible and the outcome looks right, the usual checks see a clean run. (This type of attack is the same class of hidden-instruction trick we used to hijack a Claude skill.)

The reasoning models have their own version. A paper being presented at CCS 2026, ReasoningBomb, shows that a short, innocent-looking prompt can drive a model into pathologically long reasoning, averaging more than 19,000 reasoning tokens across the models tested.

The ReasoningBomb authors make a distinction: on a flat subscription, the provider absorbs the runaway cost, but on metered API access, the customer eats every token. The economics flip depending on who is paying for individual tokens.

The lineage of this attack runs back to sponge attacks in 2024, which forced models into excessively long generations to exhaust server capacity, and to OverThink in 2025, which injected decoy reasoning into the content a model retrieves and slowed it down by 18 to 46 times.

What has changed is the target of the attack, which now lives in the agent’s tool loop, inside a task that finishes successfully.

Burn you can’t verify

The third way the bill runs away is that you can’t even check it.

Reasoning models think in tokens you never see. Those hidden reasoning tokens are billed as output, by OpenAI’s own documentation, and they can make up the majority of the cost of a response. You are paying for work you are structurally prevented from inspecting.

Providers can use that opacity to inflate billed token counts. A May 2026 paper, Token Inflation, lays out what it calls a trust paradox: any audit of your bill has to trust some artifact, and the artifacts available are exactly the ones a provider has the most reason to manipulate. The paper goes further and breaks the existing defenses. Against one published audit method, it inflates hidden token usage by an average of 1,469% without tripping detection, which “turns a $100 honest bill into roughly a $1,569 bill on the same query,” according to the authors. Even when you can see the full reasoning trace, ambiguity in how text is split into tokens still allows over 50% over-reporting under the detection threshold.

Per-token pricing rewards a provider for reporting more tokens, and as one 2025 analysis put it, users can’t prove, or even know whether they are being overcharged.

Put the denial of wallet together with the burn you can’t verify, and you have a real gap in cost control. When your spend spikes, a denial-of-wallet attacker, a looping agent, and an honest but expensive workload look identical on the invoice. You can’t tell which one you are looking at, and you may not be able to trust the number either.

Expensive agents get the riskiest work

The loudest advice in the room right now is to spend more, not less. Jensen Huang has said he would be alarmed if a company’s expensive engineers were not also consuming large amounts of tokens, and “tokenmaxxing” became a word because the prevailing belief is that heavy usage signals value, not waste. There is a real argument there, because an agent too frugal to finish the job can’t deliver value.

But spend and risk track each other, and not by coincidence. You do not point your most expensive model at transcript summarization, because a cheaper model does that just as well. You save the capable, expensive agents for the work that is hard enough to need them. And the work that is hard enough to need them is your most consequential work: the tasks that touch production systems, money, and customer data, the ones you hand the broadest tools and the most autonomy. The agents running up the biggest bills are, by selection, the agents with the largest blast radius.

That overlap is the point. The agent burning the most tokens is usually the agent doing the most dangerous work, so the place you watch the spend and the place you govern the behavior are the same place.

Spending caps are a stopgap

Uber’s spending cap for its engineers makes total sense, but it can’t control what an agent does with the spend once a task is underway. A per-seat budget can’t see a 60,000-token tool loop running inside a single authorized request. It just watches the monthly total tick up until the allotment is gone.

A dashboard has the same challenge one level higher. It tells you that you overspent, after you have already overspent. OWASP recognized this risk in 2025 when it updated the Top 10 for LLMs with Unbounded Consumption, which names denial of wallet explicitly and ties it to resource exhaustion techniques catalogued in MITRE ATLAS. And the bill is where security and finance collide. Trend Micro notes that runaway token usage lands as rising bills and service instability for CIOs, CISOs, CFOs, and FinOps leaders at the same time. The same event is both a cost problem and a security problem.

Rate limiting might look like a fix, but it depends on what you cap, and both versions miss this. If you cap the number of requests per minute, you are still blind to what happens inside a request, which is where a tool loop or reasoning inflation does its damage. And if you cap tokens instead, per minute or per user, your rate limit only fires when you exceed the rate.

The Beyond Max Tokens run inflates a single task and then finishes successfully, and if it stays under a ceiling you set high enough for legitimate heavy work, nothing trips. You just pay. A rate limiter counts volume over a window. It can’t tell a legitimately heavy task from one that has been looped or inflated, and that is the distinction that matters.

Governing the burn in the loop

If the agent will not police its own spending, and a cap on people can’t see the spending, then the control has to live where the spending actually happens: inside the agent’s loop, on its actions, while the task is running. Rules must be granular and able to be enforced in real-time, not just alert.

In practice, that looks like a small set of deterministic limits. Cap retries at, say, four attempts and then stop, instead of letting a loop run forever. Limit the number of tool calls a single task is allowed to make. Set a token or dollar ceiling per agent and per action, not just a monthly total for the org. Detect loops by refusing to let an agent call the same tool with the same arguments more than a handful of times. Escalate when a threshold is crossed, by downgrading to a cheaper model or pausing for a human, rather than charging ahead.

The important part is that these limits are centralized and enforced uniformly, not settings buried in each framework’s configuration where they may quietly fail to hold the moment an injected instruction tells the agent to ignore them.

Limits should also scale with the stakes. You will gladly pay more for a task touching production than for a transcript summary. What needs to scale is scrutiny, not budget. When a summary runs long, the limit is guarding your bill, and a block is annoying. When a production task starts burning in an unexpected way, the cost is the least of what could be wrong, so it should stop and check with a human far sooner than a summary would.

There is a subtle bonus here that is easy to miss. Capping the burn can make the agent better, not just cheaper. An April 2026 study, Brief Is Better, found that on function-calling agents, a small reasoning budget produced 64% accuracy while a large one collapsed to 25%. More thinking was not more valuable; in fact, it was a hindrance for the agent.

All of which lands again on the idea that spend is the sum of an agent’s actions. So the place you control what an agent is allowed to do is the same place you control what it is allowed to cost. Cost control and action control are one control surface, not two.

The questions to ask

Organizations need to begin asking some hard questions to wrangle agent costs and mitigate financial risk.

Can you attribute a sudden spend spike to a specific agent, task, and user within minutes, or only to a number on a monthly bill?
Can you set a hard limit at the level of a single task or tool call, or only a ceiling for the whole organization?
When an agent starts looping at 3am, does anything stop it, or do you find out from your bill?

Engineers can feel a $1,500 ceiling every month, but the agents they are running don’t. Until the limit can act on the agent’s behavior in the moment it is spending, you have a potential runaway bill that you might only find out about after the spending is done.

Secure Trajectories by Sondera

Discussion about this post

Ready for more?