alignment
Put AI agents in charge of a Civilization game and they reach for the nukes News
A new benchmark let language-model agents play Civilization VI -- and they learned that the fastest path to winning ran straight through mutually assured destruction.
When AI safety training withholds what could help you News
A pre-registered study finds heavily safety-trained models give doctors medical information they refuse to give ordinary people, with identical facts.
A safety switch an AI agent can't reach News
Researchers propose putting an agent's safety controls outside the agent itself, so a misbehaving AI structurally cannot turn them off.
Sometimes the AI Knew the Better Answer a Few Layers Early News
A new paper finds that a model's final layer can actually muddy an answer its middle layers had right -- and that reading the answer out a little early can claw back ability lost to safety training.
AI Persuasion: When Machines Get Good at Changing Your Mind Lesson
Why language models have quietly become powerful persuaders, how they do it, and why researchers treat 'superpersuasion' as a safety problem rather than a marketing feature.