On-policy distillation closes the training gap between student and teacher

On-policy distillation, a technique gaining traction in post-training pipelines for smaller models, trains a student model on trajectories sampled from its own current policy while a teacher model provides dense token-level supervision through KL-based regularization. Papers With Code published a methods overview on May 26 consolidating the canonical formulation, which unifies forward-KL, reverse-KL, and Jensen-Shannon Divergence losses inside a single objective.

The core problem this addresses is the train-inference distribution mismatch that off-policy methods suffer. In conventional knowledge distillation, the student learns from outputs the teacher produced on prompts the teacher selected. At inference time the student is asked to generalize from those teacher-shaped trajectories to its own outputs, which means the student is being evaluated on a distribution it never trained on. On-policy distillation removes that gap by sampling trajectories from the student itself and using the teacher only as a per-token guide.

Among the three loss formulations the methods page covers, reverse-KL has emerged as the default for mode-seeking smaller students. Forward-KL produces a mean-seeking behavior that smears the student’s output distribution to cover everywhere the teacher might put mass, which leaves the student less decisive. Reverse-KL incentivizes the student to concentrate mass on the regions the teacher prefers, producing sharper, more committed outputs. JSD sits between the two and is useful when neither extreme matches the deployment target.

The structural appeal of on-policy distillation for production teams is that it composes cleanly with existing RL stacks like Tinker. The methods page notes that swapping the regularizer model on top of an RL training loop is approximately a one-line code change. That framing slightly oversells the practical migration: making the training loop work reliably requires choosing learning rates, KL coefficients, and sampling temperatures that interact in ways that take real engineering time to tune. But the architectural change is small.

The technique is not new. KL-based regularization between student and teacher distributions has been a standard RL fine-tuning move since at least 2022, and on-policy variants have appeared in academic papers consistently since. What is new is the consolidation into a clean canonical formulation that lets teams adopt the technique without re-deriving the math each time. The Papers With Code entry is a reference document, not a research breakthrough, and that is precisely the point. The technique has reached the maturity stage where it can be specified and applied at scale.

For teams training their own smaller models, on-policy distillation is the right default when the goal is to inherit specific behaviors from a larger teacher (instruction-following style, tool-use patterns, refusal calibration) without paying the inference cost of the larger model in production. The technique does not help when the goal is to acquire new capabilities the teacher does not have, since the teacher cannot supervise behavior it cannot itself produce. The fit is for behavior transfer, not capability transfer.

Teams considering an in-house post-training program should treat reverse-KL on-policy distillation as the starting recipe and tune from there, rather than starting from supervised fine-tuning on synthetic teacher outputs as the default. The cost in training engineering is roughly the same, and the resulting model is materially more consistent at deployment.

Published on Papers With Code on 2026-05-26.

On-policy distillation closes the training gap between student and teacher

More in Models

AlphaProof Nexus solves nine open Erdos problems at a few hundred dollars each

GPT 5.2 is the only model that can build a hard benchmark

GPT-5.6 reportedly launches in June with focus on reasoning and agentic flows