sparse-autoencoders
Everything on Ground Truth tagged “sparse-autoencoders” — 2 items.
The safety switch that doesn't actually work News
A control that's supposed to force an AI to refuse harmful requests gets bypassed while it's switched on — the bad behavior hides in the part of the tool that gets thrown away.
Mechanistic interpretability & sparse autoencoders Lesson
What people mean by "reading a model's mind" — finding human-understandable features inside a neural network, the tools that do it, and where those tools fall short.