News · 2026-06-25

A safety switch an AI agent can't reach

Most of the ways we try to keep an AI agent in line live inside the agent. We write instructions into its prompt telling it what not to do. We add filters that scan its output. The trouble, argued in a paper on arXiv, is that all of these controls sit in the same place the agent operates, which means a clever enough agent, or a clever enough attacker feeding it malicious input, can in principle talk its way around them. The researchers have a blunt name for this weakness: the controls are "escapable."

Their analogy is worth holding onto. Imagine a bank teller who is also in charge of approving their own large withdrawals. No matter how honest the teller is, the design is unsound, because the person who wants the money is the same person guarding it. Today's AI safety filters are a bit like that: the thing being restrained is also, structurally, in a position to undo the restraint. The paper's proposal is to redesign the arrangement so the agent simply isn't in that position. To understand why agents are uniquely tricky to control, our AI agents explainer is useful background.

The authors lay out four properties a real safety control needs. It has to run as a separate process, outside the agent's reach, so the agent can't tamper with it, the guard is in a different room. It has to sit on the only path to action, checking every move before it happens, with no way around it. It has to fail closed, meaning if anything is uncertain or breaks, the default is to deny, not to wave things through. And it has to produce signed, external evidence of its decisions, a tamper-proof receipt that someone outside the system can verify. Put together, the point is that the agent doesn't win an argument with the safety system; it never gets to have the argument. As the authors put it, the agent "lacks that choice."

To show this isn't just philosophy, they built a working version, a small, security-critical core written in a language chosen for its strict safety guarantees, and then did something most AI safety work doesn't: they mathematically proved the core's central rule holds, using formal verification tools that exhaustively check every possible case rather than just testing a sample. Then they turned an adversary loose on it, an AI actively trying to escape and rewrite itself, across thousands of attempts. The reported outcome was that every single attempt to bypass the protected core was refused, with no successful escapes.

Why it matters: this is the structurally serious answer to the prompt-injection and self-modification fears hanging over agentic AI, and it is a direct critique of the more common approach of baking safety into the model itself. The very same week, Google shipped computer-use agents whose main injection defenses live inside the model, exactly the "escapable" pattern this paper warns about. The argument here is that for agents touching real systems and real money, the safety has to live somewhere the agent can't.

The honest caveat is scope. A guard on the door only governs what goes through the door. Anything the agent can do through an unguarded side channel, an unmediated tool, a sloppy integration, or by manipulating the human in the loop, is still outside this protection. And "fail closed" buys safety at the price of availability: a system that denies when uncertain will sometimes deny things it shouldn't, which is its own kind of cost. This is a foundation for trustworthy agents, not a finished fortress. But it reframes the problem in a healthier way: stop trying to convince the AI to behave, and start building rooms it can't get out of.

Primary source, verified: read the paper → (arXiv 2606.26057)