News · 2026-07-01

Two new papers push 'on-policy distillation' to fix privileged teachers and merge specialist skills

Two new papers released together push on-policy distillation forward, a training technique that is quietly becoming central to how strong-but-small models get built. DOPD (Dual On-policy Distillation) fixes a subtle failure where a too-privileged teacher teaches skills the student cannot actually reproduce, and MOPD (Multi-Teacher On-Policy Distillation) shows how to fold several independently-trained specialist models into one generalist without the usual cross-skill interference. Both have already been used in real frontier-model training, which is what lifts them above the usual research-paper churn.

Key facts

Both papers refine on-policy distillation -- training a student on its own outputs with a teacher grading each step -- rather than copying a teacher's separate examples.
DOPD (arXiv:2606.30626) dynamically routes per-token supervision between teacher and student to defeat the 'privilege illusion,' improving stability and out-of-distribution performance in both language and vision-language models.
MOPD (arXiv:2606.30406) merges multiple specialist RL teachers into one student and was deployed in the post-training of MiMo-V2-Flash, an industrial-scale model.
Both were submitted to arXiv on June 29, 2026.

First, the shared idea. Classic distillation trains a small student to imitate a large teacher's outputs. But if the student only ever studies the teacher's answers, it never learns to recover from its own mistakes -- at inference time it drifts into territory the teacher never demonstrated. On-policy distillation flips this: the student generates its own attempts, and the teacher grades them token by token. The student learns from the exact situations it actually gets into. It is the difference between studying a grandmaster's recorded games and playing your own games with the grandmaster looking over your shoulder correcting each move.

DOPD tackles a specific way this can go wrong. Teachers are often given 'privileged information' during training -- verified reasoning hints, ground-truth scaffolding -- that the student will not have in the real world. Learn naively from such a teacher and you get a 'privilege illusion': the student mimics behavior that only made sense because of information it can never access, so it looks like it learned the skill without really having it. DOPD adds what the authors call advantage-aware dual distillation. For each token, it decides whether to trust the privileged teacher or the student's own signal, based on how much of an edge the teacher's guidance actually confers. Where the teacher's advantage comes from privilege the student cannot use, DOPD leans on the student instead. Across both language and vision-language settings, this yields more stable training and better generalization to out-of-distribution tasks than vanilla on-policy distillation.

MOPD attacks a different, very practical headache: how do you build one model that is good at everything? Reinforcement learning works beautifully when you train a model on one skill -- math, coding, instruction-following -- but combining several RL-trained skills into a single model usually causes them to interfere, and you lose performance. MOPD's answer is to train each specialist teacher separately and in parallel, then distill all of them into a single student on the student's own rollouts. Because the specialists are developed independently, teams can build them in parallel with no cross-domain coupling, and the student inherits nearly all of each teacher's ability. On a 30-billion-parameter base it beat mixing, cascading, off-policy fine-tuning, and parameter-merging baselines, and it was used in the real post-training of an industrial frontier model.

Why it matters: on-policy distillation is becoming the connective tissue of modern model-building -- the mechanism that transfers expensive RL-earned skills into cheaper, deployable models and lets labs assemble a generalist from a roster of specialists. DOPD makes that transfer more honest, and MOPD makes it modular, so different teams can own different capabilities and merge them cleanly. Together they read like infrastructure for an assembly-line approach to frontier post-training.

The honest caveat: both are new results with strong but self-reported gains, evaluated on specific model families, and the deployment claims -- while notable -- come from the same teams proposing the methods. Distillation research also tends to show its best numbers on the benchmarks the authors chose; independent replication across more model sizes is what would confirm these become standard practice rather than clever one-offs. The direction, though, is clearly where post-training is heading. Follow the model-training beat at Ground Truth.

Primary source, verified: read the paper → (arXiv 2606.30626)

Key questions

What is on-policy distillation?

It is training a smaller student model on its own generated outputs, with a stronger teacher grading each step, so the student learns from mistakes it actually makes rather than from a teacher's off-distribution examples.

What problem does DOPD solve?

DOPD fixes the 'privilege illusion,' where a teacher with access to hints the student will never have teaches the student to imitate behavior it cannot actually reproduce, by dynamically choosing per-token whether to trust the teacher or the student.

What does MOPD do differently?

MOPD trains several specialist reinforcement-learning teachers independently -- one for math, one for coding, and so on -- then distills all of them into a single student on its own rollouts, inheriting nearly all their skills without the usual multi-domain interference.

Topics: distillation · rl-post-training · llm-training · on-policy · capability-integration

Comments are replies to this story on Bluesky — reply with any Bluesky account to join in.