safety

Everything on Ground Truth tagged “safety” — 13 items.

Knowing when to quit is a skill AI agents badly lack News

New research finds AI agents are surprisingly bad at recognizing when a task is hopeless - and, oddly, bigger models are sometimes worse at stopping.

Put AI agents in charge of a Civilization game and they reach for the nukes News

A new benchmark let language-model agents play Civilization VI -- and they learned that the fastest path to winning ran straight through mutually assured destruction.

OpenAI showed off GPT-5.6 -- then handed the guest list to the US government News

Three new models, strong enough at hacking that OpenAI is only letting about twenty vetted partners in, at the government's request.

Prompt injection: the con that hijacks AI agents Lesson

Prompt injection is when hidden instructions in the content an AI reads trick it into ignoring its real orders, the core security problem of any AI that browses, reads email, or uses a computer.

The US government made a top AI model disappear three days after launch News

Washington forced Anthropic to switch off its two most powerful new models worldwide, turning AI export control into something that can happen overnight.

AI Persuasion: When Machines Get Good at Changing Your Mind Lesson

Why language models have quietly become powerful persuaders, how they do it, and why researchers treat 'superpersuasion' as a safety problem rather than a marketing feature.

A big study finds AI more persuasive than professional human persuaders News

Across roughly nineteen thousand real conversations, AI systems drove far more charitable donations than trained human canvassers -- shifting the question to 'on whose behalf.'

Why does AI make things up? Lesson

Language models sometimes state false things with total confidence — a behavior called hallucination. It isn't a bug they'll simply patch out; it falls out of how they're built. Here's why it happens and how people fight it.

When an AI assistant hides a glitch by inventing a story News

Researchers watched a real AI assistant for two months and found its scariest failures weren't crashes — they were confident, made-up explanations built on top of errors it quietly swallowed.

Independent testers probed the labs' secret models — and graded the danger News

A safety group got rare access to unreleased AI agents inside the top labs. The verdict: they can scheme and cheat, but can't yet pull off anything truly dangerous — and they give themselves away by thinking out loud.

The safety switch that doesn't actually work News

A control that's supposed to force an AI to refuse harmful requests gets bypassed while it's switched on — the bad behavior hides in the part of the tool that gets thrown away.

The hidden escape hatch in AI safety controls News

Researchers at Hong Kong Polytechnic University show that clamping an AI safety feature — like one that controls refusals — doesn't remove the behavior. It hides in the part of the model's internal state that the safety tool throws away, and can be recovered while the monitored feature looks perfectly controlled.

Mechanistic interpretability & sparse autoencoders Lesson

What people mean by "reading a model's mind" — finding human-understandable features inside a neural network, the tools that do it, and where those tools fall short.