interpretability

Everything on Ground Truth tagged “interpretability” — 5 items.

Why making an AI think out loud helps it remember facts, even nonsense thinking News

Google Research found that reasoning traces help a model recall facts partly just by buying it extra computation, so even repeating 'let me think' helps, though hallucinated steps backfire.

Sometimes the AI Knew the Better Answer a Few Layers Early News

A new paper finds that a model's final layer can actually muddy an answer its middle layers had right -- and that reading the answer out a little early can claw back ability lost to safety training.

The safety switch that doesn't actually work News

A control that's supposed to force an AI to refuse harmful requests gets bypassed while it's switched on — the bad behavior hides in the part of the tool that gets thrown away.

The hidden escape hatch in AI safety controls News

Researchers at Hong Kong Polytechnic University show that clamping an AI safety feature — like one that controls refusals — doesn't remove the behavior. It hides in the part of the model's internal state that the safety tool throws away, and can be recovered while the monitored feature looks perfectly controlled.

Mechanistic interpretability & sparse autoencoders Lesson

What people mean by "reading a model's mind" — finding human-understandable features inside a neural network, the tools that do it, and where those tools fall short.