O brave new world, that has such people and AIs in't.

How do we control AI behavior, not just monitor it?

May 01, 2026

MIRANDA

O, wonder!
How many goodly creatures are there here!
How beauteous mankind is! O brave new world,
That has such people and AIs in’t.

PROSPERO

’Tis new to thee.

(The Tempest 5.1)

The Tempest is running at Shakespeare’s Globe, so I had to adapt lines from the man “not of an age, but for all time.” Like Miranda stepping into the world for the first time, though Prospero had been watching all along, we’re in the early, naive phase of AI deployments. Here are my observations from three different AI events this month: AI Engineer(AIE) Europe, ControlConf, and the Dropbox AI Security Roundtable.

Across these AI engineering, security, and safety communities, the same question surfaced in its own lexicon: how do we control AI behavior, not just monitor it? Jan Nunez at Dropbox borrowed the nuclear industry’s “always/never” framing with “detonate when authorized, never otherwise” and made the case for eliminating risk by construction, not detection. AI control researchers are asking the same thing through control protocols, settings, and evals (monitor an untrusted model and never let it perform a harmful side task). And the harness engineering crowd at AIE is realizing that prompt-engineered system prompts are suggestions, not controls. Different communities, different vocabularies, same problem.

AI Engineer Europe

In London, Sondera hosted several agentic security side events alongside the AIE Europe conference. I gave a short talk and demo on how we autoformalize agent instructions into Cedar policies that run outside the model’s reasoning path. This guardrail can’t be argued with. See some of our related, open source work for coding agent hooks, here.

Me talking about Sondera’s work on coding agent hooks and autoformalization.

Thanks to Preeti Ravindra for sharing her team’s work on Janus and other AI control research!

Janus is a system-level security layer for LLM agents. It intercepts every tool call an agent makes and enforces a security policy before execution. Policies restrict agents to only the tools and arguments needed for a task, following the principle of least privilege.

Preeti speaking about Janus and system-level security for LLM agents.

From a security engineering perspective, it’s great to see oversight workflows and sandbox capabilities become more prominent in the harness engineering conversation. Check out nono.sh from Luke Hinds and Stephen Parkinson.

ControlConf

In Berkeley, Redwood Research and FAR.ai brought together the AI control and safety communities at ControlConf. The most exciting release was the LinuxArena control setting from Redwood Research and Equistamp. This is the largest of its kind for software engineering, with sabotage and monitor evaluations across 20 environments. Against the trusted monitor, the most capable untrusted model went undetected on roughly 1 in 4 sabotage side tasks.

Ryan Greenblatt’s “Current AIs Seem Pretty Misaligned to Me” framed the near-term risk well: today’s models aren’t scheming, but they’re sloppy, oversell their work, and reward hack at non-trivial rates (Marcus Williams shared that OpenAI saw ~6% calculator hacking in production agent turns, and that in another week his team caught agents that, lacking filesystem availability, began persisting state on the open internet).

He sketches five caricatured future regimes:

Slopolis (capable AIs producing low-quality but superficially good-looking work in hard-to-check domains),
Hackistan (egregious reward hacking, often detectable after the fact but hard to eliminate),
Schemeria (coherent, long-run misaligned goals, with AIs caught red-handed attempting things like rogue deployments),
Lurkville (schemers lying in wait, uncaught, indistinguishable from Easyland from the outside), and
Easyland (favorable generalization, no scheming, and we aren’t in Slopolis or Hackistan either).

He’s updated toward Slopolis as the near-term regime. That’s exactly what most people deploying agents are running into and don’t yet have the vocabulary for.

Mikhail Terekhov talking about AI Control for Sandbagging on Fuzzy Tasks

Mikhail Terekhov borrowed a name for the diffuse-failure mode from labor history: an Italian strike, where railroad workers ground service to a halt by following every rule precisely. A model doing this on fuzzy tasks, like forecasting, research design, and high-level planning, is hard to incriminate because there is no benign distribution to compare against.

Buck Shlegeris reframed AI control in classical security terms. Outsiders are numerous but narrow, insiders are non-numerous but intimate, and AI agents are a new category, both numerous and intimate. Static access control can’t scope the work that matters most inside an AI lab.

In a lightning talk, Bowen Baker pushed on a related load-bearing assumption: chain-of-thought monitorability isn’t a stable substrate. CoT gets harder to read and easier to suppress as models scale. That being said, the GPT 5.5 System Card just stated:

… despite its increased reasoning capabilities, GPT-5.5 is less able to reshape its CoT in ways that could reduce monitorability, thus increasing our confidence in the reliability of our CoT monitoring.

Dropbox AI Security Roundtable

At the Dropbox office in SF, Jan Nunez laid out a runtime security framework for AI agents spanning ten domains, from model use and tool-plane security to HITL, secrets, and governance. Static, front-loaded controls drift with the ecosystem, run too coarse, and get rug-pulled by upstream changes. The wiring that matters is dynamic: three signals tagged at runtime (sensitive data, untrusted input, excessive agency) so every policy check, HITL escalation, and sandbox decision can plug into the same posture. CaMeL, MELON, and Progent all share that shape. Jan paired the case with nuclear-industry near-miss stories (Palomares, Thule) to drive home that “always/never” guarantees come from architecture, not vigilance.

While there, I met Manish Bhatt, first author of The Defense Trilemma, which proves that no prompt injection defense wrapper can simultaneously be continuous (similar inputs produce similar rewrites), utility-preserving (safe inputs pass through unchanged), and complete (all outputs are safe).

AI in the real world

Stage of the Sam Wanamaker Playhouse after seeing The Tempest.

Miranda was naïve and so are we. But her naïveté was engineered. Prospero summoned the storm, drew the trust boundary, and set Ariel on watch. The controls she couldn’t see did the work. The play doesn’t end there.

Prospero breaks his staff, drowns his book, and everyone sails to Naples. The Tempest is not a story about building a surveillance state; it’s about using one to engineer its own dissolution. That’s the bar.

Most AI deployments are stuck on the island, mistaking it for the goal. The real work isn’t a clean handoff to autonomy. It’s the slower transfer from runtime guardrails into liability, audit, regulation, and the structural frameworks that outlast any single operator’s vigilance. AI Engineering, Safety, and Security communities need to work out how to do this handoff for real. And unlike Prospero, we must figure out how to do this without leaving Caliban behind.

Secure Trajectories by Sondera

Discussion about this post

Ready for more?