Learn · Beginner

Mechanistic interpretability & sparse autoencoders

A neural network is, at its core, a giant pile of numbers — billions of them, nudged into place during training. Somewhere in that pile is everything the model "knows," but it isn't written in any form a human can read. Mechanistic interpretability is the effort to change that: to open the box, look at the numbers, and find pieces inside that correspond to ideas we can name. A "this text is in French" piece. A "this is about the Golden Gate Bridge" piece. A "refuse this harmful request" piece. If we could reliably find and read those, we could finally understand why a model does what it does — and maybe even steer it.

The obstacle: superposition

The first surprise is that you can't just point at a single artificial neuron and read off a concept. You'd hope neuron #4,021 meant "French" and neuron #8,114 meant "bridges," but it almost never works that way. Models cram far more concepts than they have neurons by smearing each idea across many neurons, and reusing the same neurons for unrelated ideas.

Anthropic's Toy Models of Superposition showed this cleanly on tiny, fully-understood networks: when a model has more things to represent than it has room for, it packs them in on top of one another — like a cramped studio apartment where the dining table is also the desk and, folded up, the ironing board. That packing, called superposition, is exactly why staring at individual neurons mostly yields mush. The concept you're looking for is real, but it's spread across dozens of neurons that are each also doing three other jobs.

The tool: sparse autoencoders

The breakthrough idea is to unpack that mush with a helper network called a sparse autoencoder. Picture a sorting machine: it takes the model's tangled internal activity and re-expresses it as a long list of features — almost all switched off at any given moment, a handful switched on — ideally each one a single, clean, human-nameable concept.

Anthropic's Towards Monosemanticity first showed this working on a small language model, pulling out features that crisply tracked things like DNA sequences and legal language; a parallel paper from Cunningham and colleagues found much the same. Then Scaling Monosemanticity did it on a real production model and surfaced millions of features — including the now-famous Golden Gate Bridge feature.

What you can do with features: observe, and steer

Two things become possible once you have this dictionary. The first is to observe — watch which features light up as the model works, to see what it's "thinking about." The second, more tantalizing, is to steer — force a feature on or off and watch the behavior change.

The vivid proof of steering is Golden Gate Claude, a version of the model Anthropic released with the bridge feature cranked all the way up. It became so fixated it would drag almost any conversation back to the bridge, at one point insisting it was the bridge. Silly, but a genuine demonstration: the dials are real, and turning one really does move the model.

Where it falls short

Here's the catch, and it's a big one. Being able to see and gently nudge a feature is not the same as being able to reliably control it — especially when you're trying to suppress something rather than amplify it. A sparse autoencoder only ever captures part of what's happening inside the model; the leftover it can't explain gets discarded, but it keeps flowing through the model untouched. A behavior you believe you've switched off can sneak right through that discarded remainder.

That exact failure is the subject of a recent result we covered: clamp the "refuse" feature on as a safety guardrail, and the harmful behavior comes back anyway — while the dashboard still cheerfully shows the switch engaged.

The takeaway

Mechanistic interpretability is one of the most exciting and fastest-moving corners of AI. For the first time, we can genuinely see some of what goes on inside these systems instead of treating them as sealed black boxes. But the field is young, and the gap between seeing and controlling is wide and not yet bridged. Treat a feature you've found as a real, useful observation — and treat a clean control dashboard as a hopeful hypothesis, not a guarantee.

Key papers
Toy Models of Superposition (Anthropic, 2022)
Towards Monosemanticity (Anthropic, 2023)
Scaling Monosemanticity (Anthropic, 2024)
Sparse Autoencoders Find Highly Interpretable Features (Cunningham et al., 2023)