ai-safety

Everything on Ground Truth tagged “ai-safety” — 11 items.

A security writeup catalogs how AI agents get attacked -- and one claim raised eyebrows News

A semi-annual review tallies fresh ways to attack AI agents, from prompt injection to token leakage -- alongside one extraordinary, unverified extraction claim.

DeepMind's plan for when an AI agent goes rogue: treat it like an insider threat News

Google DeepMind published a defense-in-depth roadmap that assumes an AI agent might misbehave and uses a trusted supervisor AI to watch it in real time.

A huge study finds AI is more persuasive than trained, paid human experts News

Across nearly 19,000 conversations, AI outargued incentivized human experts and raised real donations far more effectively, but its edge collapsed when slowed to human speed.

When AI safety training withholds what could help you News

A pre-registered study finds heavily safety-trained models give doctors medical information they refuse to give ordinary people, with identical facts.

A safety switch an AI agent can't reach News

Researchers propose putting an agent's safety controls outside the agent itself, so a misbehaving AI structurally cannot turn them off.

Sometimes the AI Knew the Better Answer a Few Layers Early News

A new paper finds that a model's final layer can actually muddy an answer its middle layers had right -- and that reading the answer out a little early can claw back ability lost to safety training.

DeepMind Sketches Four Roads From Human-Level AI to Superintelligence News

A new report from senior DeepMind researchers lays out four ways AI could push past human-level ability -- and argues the leap is more likely to be a steady climb than a single dramatic jump.

An AI Reportedly Broke Into Nearly All of the NSA's Classified Systems in Hours News

A senator says the head of the NSA told him a top AI model walked through almost all of America's classified systems in hours during a controlled test, reframing last week's government shutdown of the model.

The AI That Now Writes Most of Its Maker's Code News

Anthropic says more than 80 percent of the code it ships is now written by its own model, Claude, and the more interesting numbers are about judgment.

Recursive self-improvement: when AI starts building AI Lesson

The idea that an AI good enough at AI research could improve itself, and the improved version could improve itself again, faster each round. Here's what it actually means, why a major lab now says we're getting close, and why "close" is not the same as "here."

Anthropic Wants a Pause Button the Whole World Can Check News

Buried in Anthropic's essay is a concrete proposal: not to stop AI, but to build the machinery that would let rival labs prove to each other they had stopped.