mechanistic-interpretability

Everything on Ground Truth tagged “mechanistic-interpretability” — 2 items.

Polishing AI by looking inside its 'mind' instead of just thumbs-up, thumbs-down News

Reward training usually treats the model as a black box — thumbs up, thumbs down, hope for the best. A new method peers inside to see why an answer was preferred, and shapes the lesson on purpose.

The hidden escape hatch in AI safety controls News

Researchers at Hong Kong Polytechnic University show that clamping an AI safety feature — like one that controls refusals — doesn't remove the behavior. It hides in the part of the model's internal state that the safety tool throws away, and can be recovered while the monitored feature looks perfectly controlled.