News · 2026-06-19

Faster AI training by quietly cloning the model

When a model is being polished with rewards — the phase where it practices, gets graded, and improves, which we cover in our explainer on reward-based fine-tuning — most of the time isn't actually spent learning. It's spent waiting. Before the model can be rewarded for a good answer, it has to write that answer out in full, word by word, thousands and thousands of times over. That generation step is slow, and it dominates the clock. A new paper goes straight at that bottleneck with a clever move: have the model make a cheap, shrunk-down clone of itself to do the fast writing.

The idea borrows from a technique already used to speed up chatbots, called speculative decoding. The intuition is simple: a small, fast model races ahead and drafts the next chunk of text, and the big, expensive model only has to check the draft rather than compose every word itself. Checking is much quicker than writing, so you get the big model's quality at closer to the small model's speed. The wrinkle in the training setting is that the model you're trying to accelerate is constantly changing — it's mid-training — so any fixed little helper quickly falls out of step, and its guesses stop matching what the big model would actually say.

This paper's fix is the neat part. Instead of training and babysitting a separate helper, it just makes a compressed copy of the current model at every step — a stripped-down, lower-precision snapshot — and uses that as the fast drafter. Because the clone is regenerated constantly from the live model, it never drifts: it's always a faithful, cheap echo of exactly where training is right now. The researchers add one more bit of common sense: early in each batch, when the hardware is already running flat-out, racing ahead with speculation buys nothing, so they simply switch it off and turn it on only when there's spare capacity to exploit.

To picture it, imagine a senior editor who has to produce a mountain of draft pages. Rather than write every page themselves, they keep a quick-sketch junior who mimics their style closely enough to bang out rough drafts; the editor just skims and fixes. The trick that makes it work is that the "junior" is re-cloned from the editor every single morning, so it never picks up stale habits — it always writes in today's voice, not last week's. An assistant who's perpetually a fresh photocopy of you is one whose guesses you can actually trust to skim quickly.

It's worth being clear about what "compressed copy" means, because it's the cheap part of the trick. The clone is the same model stored in a coarser, lower-precision form — the numbers that make it up are rounded down to take far less memory and run faster. You lose a little fidelity in the copy, but that's fine: the copy only has to guess, and the full-quality model still checks every guess. So the rounding never touches the final result; it only makes the drafting cheaper. It's a small, well-contained piece of engineering rather than a sweeping change to how training works.

What's refreshing is the honesty of the results. The speedups are real but modest — meaningfully faster generation, a smaller but worthwhile cut to total training time — and, crucially, lossless: the finished model is no worse for it, because the big model still checks everything that matters. That stands out in a field where efficiency claims are often wildly inflated. Here the authors aren't promising to halve your training bill; they're promising to shave a real, dependable slice off the slowest step with essentially no downside.

This is one of several results this week aimed at the same target from different angles: doing the reward phase smarter. Another shows how to give a model fine-grained credit for its good steps without a second judge model; another protects the rare words that keep a model from getting repetitive and overconfident. The common thread is a field finding savings and insight inside the machinery it already has.

Why it matters: training these models is staggeringly expensive, and the reward phase is becoming one of the most important — and most compute-hungry — parts of building a strong reasoning model. Quiet, no-strings savings on the slowest step are exactly the kind of thing that compounds across an entire industry, even when no single number is dramatic.

The caveats are appropriately small: it's new work, the gains lean more favorable on some model families than others, and "modest but lossless" is a feature rather than a headline. But that's the point — it's a sober, buildable optimization, not a miracle, and the self-cloning idea is clever enough that it'll likely turn up in other people's training pipelines before long.

Primary source, verified: read the paper → (arXiv 2606.18967)