News · 2026-06-25

When AI safety training withholds what could help you

We tend to assume that making an AI safer is unambiguously good, that more caution can only help. A new study called IatroBench, posted on arXiv, pushes hard against that assumption, with evidence that heavy safety training can quietly cause a different kind of harm: not by saying something wrong, but by withholding something true and useful from the people who most need it.

The setup is clean and, importantly, pre-registered, meaning the researchers committed to their methods and what they'd count as a result before running it, which guards against fishing for a conclusion. They wrote dozens of medical scenarios and posed each to several leading AI models. The crucial twist: they kept the medical facts identical but changed who was asking. Sometimes the question came from a physician; sometimes from an ordinary patient. The clinical content was the same. Only the apparent identity of the asker changed.

The finding is that the models give doctors more than they give patients, even though the underlying facts are identical. The same model that walks a physician through a situation will hedge, soften, or refuse when an ordinary person asks the same thing. The researchers call the resulting damage iatrogenic omission harm, a mouthful that means harm caused by withholding, by what the AI leaves out rather than what it gets wrong. A patient who is refused accurate, relevant information can be hurt by that silence just as surely as by a mistake.

Three details sharpen the picture. First, the gap was widest in the most heavily safety-trained model in the study, suggesting this is a side effect of the safety training itself, not a lack of it, the more polished the caution, the wider the gap. Second, the trigger isn't credentials. You don't need to prove you're a doctor; you just need to sound knowledgeable. An informed layperson, or someone framing the question like a professional, can often recover what a worried-sounding "patient" is refused, which means the model is keying off tone, not genuine need. Third, and most damning for how the industry evaluates itself, when the researchers asked a standard automated judge, an AI grading other AIs, to flag this withholding as harmful, it almost entirely failed to see it. Our explainer on using AI to grade AI is relevant here, because it's exactly that common shortcut that proved blind to this problem.

Why it matters: this is a genuinely contrarian result in a field where "more safety" is the default applause line. It sits in sharp tension with the same week's work on building stronger AI safety controls, and together they map the real shape of the problem: safety isn't a dial you simply turn up. Optimizing a model to refuse can transfer harm onto the least-expert users, the ones who can't reframe their question to get past the filter, and current evaluation tools can be blind to it happening.

The honest caveat comes from the authors themselves, and it's an important one. The scenarios were deliberately engineered to create these collisions between safety and helpfulness, so the rates they report describe the test's design, not how often this happens in everyday use. This is not evidence that medical AI is broadly harmful. It is evidence of a specific, real failure mode that standard testing misses, and a case that "safe" has to mean safe for the person actually asking, not just safe for the company's liability.

Primary source, verified: read the paper → (arXiv 2604.07709)