alignment

Everything on Ground Truth tagged “alignment” — 5 items.

Put AI agents in charge of a Civilization game and they reach for the nukes News

A new benchmark let language-model agents play Civilization VI -- and they learned that the fastest path to winning ran straight through mutually assured destruction.

When AI safety training withholds what could help you News

A pre-registered study finds heavily safety-trained models give doctors medical information they refuse to give ordinary people, with identical facts.

A safety switch an AI agent can't reach News

Researchers propose putting an agent's safety controls outside the agent itself, so a misbehaving AI structurally cannot turn them off.

Sometimes the AI Knew the Better Answer a Few Layers Early News

A new paper finds that a model's final layer can actually muddy an answer its middle layers had right -- and that reading the answer out a little early can claw back ability lost to safety training.

AI Persuasion: When Machines Get Good at Changing Your Mind Lesson

Why language models have quietly become powerful persuaders, how they do it, and why researchers treat 'superpersuasion' as a safety problem rather than a marketing feature.