on-policy

Everything on Ground Truth tagged “on-policy” — 2 items.

Two new papers push 'on-policy distillation' to fix privileged teachers and merge specialist skills News

DOPD and MOPD advance on-policy distillation -- training a student on its own outputs -- with DOPD routing supervision to avoid a 'privilege illusion' and MOPD merging multiple specialist RL teachers into one model without cross-domain interference.

On-Policy vs Off-Policy Learning Lesson

On-policy learning trains a model on data generated by its own current behavior, while off-policy learning trains it on data generated by something else -- an old version, a different policy, or a fixed dataset -- and the choice shapes how stable, sample-efficient, and reliable the training is.