sparse-autoencoders

Everything on Ground Truth tagged “sparse-autoencoders” — 2 items.

The safety switch that doesn't actually work News

A control that's supposed to force an AI to refuse harmful requests gets bypassed while it's switched on — the bad behavior hides in the part of the tool that gets thrown away.

Mechanistic interpretability & sparse autoencoders Lesson

What people mean by "reading a model's mind" — finding human-understandable features inside a neural network, the tools that do it, and where those tools fall short.