The Agent PB&J Problem
The lethal trifecta is not just a story about prompt injection. It is a story about literal execution.
You’ve done this exercise. Maybe in elementary school, maybe a parent did it to you, maybe you’ve done it to your own kids. You write instructions for making a peanut butter and jelly sandwich, and someone follows them literally. When you say, “Put the peanut butter on the bread,” your dad puts the peanut butter jar on top of an unopened bag of bread. The dad joke lesson is that the gap between what you say and what you mean can be enormous, and most of the time you have no idea the gap is there.
For humans, we fill this gap with common sense, detailed checklists, and double-checking each other. For agents, though, the gap remains wide open.
This PB&J “exact instruction” exercise captures the ultimate challenge with agents. How do you know they will solve the problem the way you really intend?
Agents can follow the rules, but not the intent
The PB&J exercise survives across generations because it teaches something irreducible. Humans don’t realize how much shared context and common sense we operate on. Of course we know to open the bag, and open the jar, even if we haven’t made a sandwich before. None of those steps are in our cookbooks. All of them are in our heads. If we followed the rules as written, we’d all struggle to communicate and coordinate.
Agents, on the other hand, don’t have the same kind of common sense and ethics that humans have. They are prone to reward hacking, and literal execution of instructions is exactly what they’re trained to do. This part of the agent security conversation has been getting underweighted.
Most of the public discussion has converged on prompt injection. Someone slips a malicious instruction into a document, an email, or a web page, the agent ingests the instruction, and now the agent is doing something the user did not ask for. Simon Willison’s framing of the lethal trifecta (when an agent has untrusted input, access to private data, and external action) is the cleanest articulation of why prompt injection is dangerous. If you have all three like in a coding agent, you have a lethal trifecta of risk of data exfiltration / destruction.
But the conversation has put almost all the weight on untrusted input, the attacker, and the injection. There is no doubt there is a risk of someone tricking the agent into misbehaving.
But the PB&J problem says the harder question is whether the agent will misbehave when nobody is attacking it. The user asks for a sandwich. The agent has authorization to access the kitchen, authorization to handle the jars, and authorization to use the knife. The agent executes literally and might smash open the peanut butter jar instead of unscrewing the lid. Nobody attacked anything, and the agent did exactly what it was told. The result is still a very messy kitchen.
The accident surface
Delete the test data. Reconcile the ledger. Send the report to the team. Each instruction has a literal interpretation an LLM will cheerfully execute, and in some cases the literal interpretation can be catastrophic. Delete the test data deletes prod, because the agent didn’t know which database the user meant. Reconcile the ledger posts to the wrong account. Send the report to the team sends it to a Slack channel with a customer in it.
None of the failures above are prompt injection. They are PB&J. An authorized agent with authorized tools with ambiguous instructions and literal execution. The attack surface and the accident surface are the same.
The lethal trifecta has three legs and the conversation has focused on untrusted content and prompt injection because it makes sense that we would try to control the agent at the prompt with instructions.
However, now we have to assume prompt injection, because the leg that does the actual damage is the third one. External action is where the consequences live. An agent with bad instructions and no tools can only generate text, and text by itself is more containable than behavior. Whether the bad instruction came from a malicious payload in a PDF or from a user who forgot to mention which database, the agent acts the same way and the blast radius is the same.
The lethal trifecta is a story about how authorized capability and ambiguous intent equals failure, and the attacker becomes just one source of ambiguous intent among many. The user is another source. The LLM itself is another. A system prompt is a fourth. The agent doesn’t know which source produced which instruction. The agent just executes.
Just write a better prompt
Can prompt engineering save us here? Write a better system prompt. Build a 30,000 line CLAUDE.md. Add the missing context. Enumerate the edge cases. An instruction file is a lot of work, but the work is doable, and once it’s done the agent has much better context it needs to make a PB&J.
Better prompts and improvements in the models do help and close some of the gap. The question is whether better prompts can scale across an enterprise.
Consider what scaling in the enterprise actually demands. Every team in the company runs its own agents, with their own tools, for their own workflows. Every workflow has its own unstated assumptions about which database is prod, which channel is internal, which ledger is active, which patient is confirmed. Every employee writing a prompt has to know all of those assumptions and remember to spell every one of them out, every time, for every agent. Then the model changes. The provider updates the system prompt. A new tool gets added to the agent’s belt. The prompt that worked last quarter no longer covers the failure modes the new behavior introduces. The instruction file has to be rewritten. By whom? Reviewed by whom? Across how many teams?
The prompt engineering answer asks every employee to become the kid at the kitchen counter writing 1,500 steps for a sandwich. And even after 1,500 steps, the kid is not sure the dad will follow the rules the way the kid meant. The dad reads step 847 and decides step 847 doesn’t apply this time. The dad invents an interpretation of step 1,203 the kid did not anticipate. The dad is a literal executor, and a literal executor with 1,500 steps has 1,500 surfaces for a literal misreading. More detailed instructions are not the full answer.
And of course, at the end of the day, prompts are just suggestions, or as we call it “prompt and pray.” No matter how long or detailed your instructions are, there are no guarantees the model will follow your rules or intent.
Sandwiches are sequences
The PB&J exercise also teaches us that many mistakes are sequential. Open the bag, then take out the bread, then open the peanut butter, then use the knife on the peanut butter, then transfer it to the bread. Order matters.
In addition to data leakage and destruction, agent failures can be sequence failures. Post the medication request before confirming the patient. Move the file before checking what depends on it. Send the email before re-reading the thread. Transfer the funds before verifying the recipient. Each example is an action that would be fine in the right order and is harmful in the wrong order. The agent had permission to post the medication request. The agent had permission to move the file. The agent had permission to send the email. The permissions are not the problem. The order of operations is the problem.
And no matter how much you prompt an agent to ALWAYS DO STEP 4 BEFORE STEP 35!!!, the agent can easily do it in a different order or skip a step altogether.
The formal methods community has been thinking about ordering problems for a long time. “Linear temporal logic” is the term for reasoning about properties that hold across sequences. “This action is permitted only if these prerequisites have happened first.” “This state must never recur.”
A stateless authorization check (does this principal have permission to do this action right now) can’t express either property. Most current agent guardrails are stateless authorization checks, which means most current agent guardrails can’t catch this failure mode.
Encoding sequence into a guardrail is an open problem. Cedar, the policy language we’ve been spending time with, is stateless at the core but supports entity stores that can carry trajectory state across turns. We can approximate stateful authorization by updating entity attributes as the agent acts so subsequent calls can understand the current state. This approach works in many cases, but it doesn’t give you native sequence reasoning. Most authorization languages are stateless because authorization in traditional software is naturally request-response. Web apps don’t need to reason about whether a previous request happened before the current one. Agents do, and sequence remains an unsolved problem.
One direction the research is heading is what the paper, “Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks,” by Xiang et al. recently called system-level defenses: architectures where the agent submits a plan of intended actions before any tool is called, and an external policy engine evaluates the whole sequence before any of it executes. The plan becomes the artifact you can reason about, instead of trying to catch each action in isolation.
While sequence remains an unsolved problem in production, research like this will help the answer come into focus.
Fix the kitchen, not the cook
So how do we wrangle these agents so we can trust them to follow our intent?
One approach has been to try to put “common sense” inside the model. Train the model on more data. Fine-tune the model on more examples. Run alignment techniques. Wrap the model in a more sophisticated agent harness. Prompt the model more carefully. Hope the next model is smart enough to fill in the gaps the current one misses.
This approach is necessary, but not sufficient. The PB&J exercise tells us exactly why prompt and pray fails. If you write 1,500 steps to make a sandwich, a dad will still find a way to skip one, do it out of order, or misinterpret the intent. There is no natural language instruction you can write that closes the gap, because the agent treats instructions as suggestions.
The other half of the approach is to assume the agent is misaligned and put the controls in the environment in which it’s operating. Not in the cook. In the kitchen.
Security engineers know the principle of least privilege. A process should only have the permissions it needs to do its job, nothing more. Agents need a parallel “principle of least autonomy.”
An agent should not be allowed to break anything, even when the agent has concluded that breaking the thing is a good idea. Despite an agent’s confidence in its own reasoning, it should not always be able to act on its reasoning. Least privilege says you are only allowed to access the bare minimum to get the job done. Least autonomy says you can’t do what you have not been authorized to do right now, in this context, against these objects, in this order, regardless of how convinced you are it would help.
The kitchen knows the jar has to be open before the knife goes in. The kitchen refuses, mechanically, to let the knife touch the closed jar. The cook can be as literal as the cook wants, as creative as the cook wants, as convinced as the cook wants that smashing the jar open is the faster path. The result is still a sandwich because the environment will not permit any sequence of actions that doesn’t produce one.
For agents, this behavioral approach means the tools have to know things the agent doesn’t. The medication-posting tool has to know a patient must be confirmed first. The file-moving tool has to know which references it would break. The email-sending tool has to know whether the thread has been read. The agent does not need to be smarter. The tools need to be smarter about the agent, and the policy layer enforcing the order of operations needs to live outside the model, outside the context window, deterministic, and fail-closed.
Fail-closed doesn’t necessarily mean failing silently. When the kitchen refuses an action, the right move is often to surface the decision to a human: “I am about to delete the test database, but I can’t verify whether it has production dependencies. Proceed?” The agent doesn’t get to decide. The human does. The same “Architecting Secure Agents” paper cited above makes this case directly, treating human interaction as a core design consideration rather than a fallback.
The PB&J instruction problem is one of our most human examples of non-determinism, and we need to heed its lesson now that we’ve started talking to machines in English. The agents are listening just as literally as they ever did, and it’s up to us to make sure that an off-handed command like “make a PB&J sandwich” doesn’t turn into a painful lesson on the gap between language and intent.


