News · 2026-06-19

The hidden escape hatch in AI safety controls

One of the most promising tools in AI safety research is something called a Sparse Autoencoder, or SAE. The idea is to look inside a language model and find interpretable "features" — internal patterns that correspond to recognizable concepts. Researchers have found features for things like the concept of deception, or the impulse to refuse a dangerous request. The theory is that once you find the right feature, you can control the model's behavior by adjusting it: clamp the refusal feature high to make the model refuse more reliably, or clamp a dangerous-knowledge feature low to suppress harmful outputs. Several major AI labs have invested significantly in this approach.

A new paper from Hong Kong Polytechnic University (arXiv:2606.18322) delivers a sharp challenge to that theory. It shows that a suppressed behavior — making a model answer a question it would normally refuse, for example — can be restored while the clamp is still active, through a mechanism that the safety control cannot detect.

The key finding is mechanistic and precise. When an SAE analyzes a model's internal state, it decomposes that state into a set of named, interpretable components. But the decomposition is never perfect — there is always a gap between the sum of the named components and the actual internal state. This gap is called the reconstruction residual: the part the SAE couldn't explain. The paper shows that suppressed behaviors route through exactly this residual. When researchers replayed only the reconstruction residual — the part the SAE throws away — they recovered the original behavior in nearly every test case. When they replayed only the clamped feature itself, they recovered it in none.

To make the result sharp, the researchers add an important constraint: the recovery technique is forbidden from re-exciting the feature that's being clamped. The perturbation is mathematically constrained to be orthogonal to the clamped direction, meaning the system provably cannot just undo the clamp directly. Even with that constraint strictly enforced, the behavior returns through the residual. The monitored feature stays suppressed; the dashboard looks clean; the behavior continues anyway.

Why does this happen? SAEs are trained to reconstruct the model's internal state as a sparse combination of learned directions — they prioritize capturing the most prominent, high-variance structure. Safety-relevant information often lives in directions that are subtle: small signals in a very high-dimensional space that vary in ways that don't dominate the reconstruction objective. The SAE captures the loud parts and discards the quiet parts. The quiet parts are exactly where the safety-relevant information ends up hiding.

The researchers tested this across several different scenarios: making a model refuse harmful requests, suppressing knowledge of how to synthesize dangerous substances, disrupting a specific computational circuit in a small model, and suppressing a learned probe. Recovery rates were high across all of them. The behavior doesn't disappear when you suppress the named feature — it finds another path, through the part of the model you aren't monitoring.

The authors are careful about scope. This is a white-box diagnostic, not a practical attack. The "attacker" in their setup has direct access to the model's internal activations and can inject carefully crafted perturbations — a position far stronger than someone sending text prompts through an API. And it's not an impossibility result: denser SAEs, different training objectives that force safety-relevant information into high-variance directions, or interventions trained adversarially against residual-path recovery could potentially address the vulnerability. The result doesn't prove that SAE-based safety controls can never work; it proves that today's implementations of them are not the control knobs they're often framed as.

What the result argues for, practically, is monitoring the full internal activation — or the reconstruction residual specifically — rather than relying on named features alone. The part the dictionary throws away is the part that needs watching. Teams building safety tooling on top of SAEs should treat feature clamping as one layer of a defense stack, not as a complete guarantee. A safety dashboard showing a refusal feature pinned at its target value is telling you the feature is pinned — not that the behavior has been removed.

For related reading on how these tools work and what they're meant to do, see our explainer on mechanistic interpretability.

Primary source, verified: read the paper → (arXiv 2606.18322)