News · 2026-06-19

Crediting an AI for the right steps — without a second model to judge them

Here's a puzzle at the heart of teaching AI to reason. You reward the model when it reaches the right final answer — but a hard problem takes dozens of steps to solve, and only some of them were actually good. Maybe step three was a brilliant insight, steps four through nine were sloppy luck, and step ten happened to land on the right number. If you praise the whole chain equally, you reinforce the sloppiness right along with the insight, teaching the model that its lucky guesses were as good as its real reasoning. Figuring out which steps truly earned the reward is called credit assignment, and it's one of the genuinely hard parts of this kind of training. (If the whole reward-training idea is new to you, our explainer on reward-based fine-tuning sets the scene.)

The standard fix is to train a second AI — a "critic" — whose entire job is to look at a half-finished solution and estimate how well it's going, step by step. That works, but it's costly and finicky: you're now building, training, and maintaining a whole extra model just to dole out the credit. And if that critic is even slightly off, it quietly poisons everything the main model learns, praising bad steps and dinging good ones in ways that are hard to notice until the training has gone subtly wrong. A miscalibrated critic is one of the classic ways this kind of training fails.

A new paper makes a more elegant argument: you don't need the second model at all, because the credit signal is already sitting right under your nose. The insight is mathematical, but the gist is graspable. During this training, the system already computes a particular quantity for each word the model produces — essentially a measure of how much that word surprised the model relative to what it expected. The paper shows that, read with the right lens, that already-available number is a fine-grained, per-step credit signal. In other words, the information you were paying a whole extra model to estimate was hiding in plain sight in the numbers you were computing anyway. You just had to recognize it for what it was.

To put it in human terms: imagine grading a student's long proof. The expensive way is to hire a second teacher who reads over the student's shoulder and rates each line as it's written. This paper's way is to notice that the student's own moments of hesitation and surprise — where they paused, changed direction, committed to a leap — already tell you which lines were the load-bearing ones. The signal was in the student's working all along; you didn't need to hire anyone.

The appeal is that you get the good thing — careful, step-by-step credit instead of one blunt reward smeared across the whole chain — at essentially no extra cost, and with one fewer moving part to break. Removing the critic doesn't just save compute; it removes a notorious source of subtle bugs.

This lands as part of a clear theme running through this week's research: squeezing more out of the reward-training phase by being cleverer, not heavier. One result protects the rare words that keep a model from getting repetitive and overconfident; another speeds up training by cloning the model on the fly; this one deletes an entire helper model by noticing its job was redundant. None are flashy on their own, but together they sketch a field maturing — finding efficiency and insight inside the machinery it already has, rather than always bolting on more. After a couple of years of "make it bigger," there's something refreshing about a wave of "look closer at what you've already got."

The caveats are honest and modest: it's new work, and the gains tend toward "as good as the critic-based approach, but simpler and cheaper" rather than a dramatic leap in raw capability. There's also added subtlety in the math that has to be handled carefully to make the trick valid — read the wrong quantity the wrong way and the credit signal is garbage. But "the thing you were training a second model to compute was already in your hands" is exactly the kind of clarifying result that makes a complicated process a little less complicated — and that tends to get adopted precisely because it removes work rather than adding it.

Primary source, verified: read the paper → (arXiv 2606.20008)