News · 2026-06-20

Teaching AI with rewards — minus the expensive second model that grades it

After a language model is first trained to predict text, it goes through a polishing phase where it's rewarded for good answers and nudged away from bad ones — the step that turns a raw text-predictor into a focused, helpful assistant. A lot of the recent progress in reasoning models comes from doing this reward phase well. But there's a hidden cost most people never see: many of these methods quietly run a second model alongside the one you actually care about, whose only job is to estimate how good the current situation is. A new method proposes getting rid of it.

First, why the second model exists. When you reward a model for a long answer — a multi-step math solution, say — you face a credit-assignment problem: which of the many steps deserve the credit when the final answer is right, and which deserve blame when it's wrong? The traditional fix — borrowed from classical reinforcement learning — is to train a separate critic (sometimes called a value model) that watches along and estimates, at each point, how well things are going. That critic is what lets the system hand out fine-grained credit. The catch is that this critic is itself a large model — it costs memory, compute, and engineering effort to train and keep in sync. You're effectively running two models to improve one.

The new paper, VIMPO, shows you can skip the separate critic entirely. Its trick is mathematical: it turns out that the policy you're already training — the assistant itself — implicitly contains the information a critic would provide. By exploiting the mathematical conditions that an optimally-trained model must satisfy, VIMPO derives a value estimate directly from the model's own behavior, without ever building a second network. The judgment was hiding inside the model all along; you just have to read it out.

An analogy: imagine training for a sport with a separate coach standing on the sideline rating each move. VIMPO is like discovering that, if you set up your practice correctly, your own sense of how the play is going already encodes everything the coach would have told you — so you can let the coach go home. You keep the feedback, you drop the second salary.

Beyond saving the cost of the extra model, the authors make a second claim that matters in practice: their approach is steadier when the rewards are noisy. In the real world, the signal telling a model whether it did well is rarely clean — graders disagree, automated checks are imperfect, and some "correct" answers got lucky. The dominant critic-free method in wide use today (the one behind several well-known reasoning models, including DeepSeek-R1) can be thrown off by that noise. VIMPO is designed to stay more stable when the feedback is unreliable, which is most of the time.

Why it matters: the reward-polishing phase is where much of a model's usefulness and reasoning ability is forged, and it's run constantly across the industry. Shaving off an entire auxiliary model makes that phase cheaper and simpler — fewer moving parts, less memory, less that can go wrong. As reasoning models proliferate and labs run this phase over and over, methods that deliver the same quality with half the machinery compound into real savings. It also fits a clear pattern in this week's research: a steady push toward doing the expensive parts of training with less apparatus.

The honest caveat is about scale. Reading the value signal out of the model implicitly, rather than training a dedicated critic to provide it, leans on a mathematical relationship that can become delicate as models grow. A purpose-built critic, for all its expense, is a stable and well-understood source of feedback. Whether the implicit approach stays accurate and steady at the largest scales — or whether the estimation gets shaky when the stakes and sizes go up — is exactly what broader adoption will test. But as a cleaner, cheaper way to run one of AI's most important training steps, VIMPO is a notable entry in a fast-moving area.

Primary source, verified: read the paper → (arXiv 2606.20008)