On-Policy Distillation - Ada Lovelemon

Current Dilemma for LLM#

Currently, large models are post‑trained via RLHF, making them powerful but expensive to train and deploy, while smaller models are usually fine‑tuned with SFT or KD methods and are easier to deploy and adapt but often lack the performance of larger models.

Another notable point is that, compared to large models, smaller models sufficiently trained often outperform larger, general models in their own domains. Also, there are some benefits to using smaller models:

Edge Deployment: Smaller models can be deployed on edge devices locally with limited computational resources, enabling real-time inference and reducing latency.
Accelerated Updations: Smaller models can be updated and fine-tuned more quickly, allowing for faster iteration cycles and adaptation to new data or tasks.
Information Security: Deploying smaller models locally can avoid potential data leakage, enhancing data privacy and security by minimizing the need to transmit sensitive information to external servers.

Thus, how to combine the advantages and build a smaller model that is both powerful and efficient to deploy becomes a true question.

What is On‑Policy Distillation?#

In a nutshell, on‑policy distillation collects states from the student’s distribution, labels those states with the teacher’s outputs, supervises the student on states it actually visits, aligns the student’s decision boundary with its induced distribution, and reduces distribution shift and improves robustness.

Why on-policy?#

Traditional RLHF/RLAIF assigns a scalar reward to each student rollout based on human or model preferences, rather than step-by-step corrections. In contrast, on-policy approaches instruct the student model to mimic the teacher’s behavior on the student’s own sampled trajectories, enabling more direct learning of the correct modes.

Consider a scenario where the student must complete a task independently.The student might fail early and ultimately provide a wrong answer. Off-policy means we simply supervise the student by enforcing it to answer like teacher’s output, it could never know why its own response is wrong, and never knows where it actually fails from. In the contrast, on-policy allows the teacher to point out the mistakes along the student’s own reasoning path, and guide the student to correct those mistakes step by step. This way, the student can learn faster and grasp the correct reasoning process.

Why distillation?#

Here, distillation is a concept different from reinforcement learning. RL frameworks have a main drawback that they provide very sparse feedback. In RL, no matter how long or detailed the model’s output is (i.e., how many tokens it generates), it only receives one scalar reward as feedback for the entire output. This makes the learning signal extremely information-poor (sparse) and coarse compared to methods like supervised learning or distillation, which provide per-token guidance.

To better understand the sparsity of RL, imagine you do your homework and the next day the teacher only gives you a boolean score (tick or cross) at the end, without any hints or corrections along the way. To improve, you’d have to redo the entire assignment and guess which part was wrong—a frustrating and inefficient process. If you instead get feedback on each question or step, you can learn and improve much faster.

This is the key advantage of distillation: it provides dense, token-level, informative feedback at every step, allowing the student to learn more effectively. The student model will try to match the output distribution of a teacher model. This dense supervision not only accelerates learning but also enables smaller models to acquire nuanced reasoning patterns that would be nearly impossible to learn from sparse rewards alone.

Method	Sampling	Reward signal
SFT	off-policy	dense
RL	on-policy	sparse
OPD	on-policy	dense

Unlike DPO—which still relies on preference pairs and implicit reward assumptions—OPD requires only a fixed teacher and standard supervised learning.

Formal objective#

Let $\pi_\theta$ denote the student policy parameterized by $\theta$ and $\pi_0$ the (fixed) teacher policy. The canonical on‑policy distillation objective is

\text{RKLD}(\theta) = \mathbb{E}_{\mathbf{x} \sim \pi_\theta} \left[ \sum_{t=1}^{T-1} \log \pi_\theta(x_{t+1} | x_{\leq t}) - \log \pi_0(x_{t+1} | x_{\leq t}) \right]

where we choose the per-token reverse KLD for each token conditioned on the same prior trajectory. When the student continues to imitate teacher’s behavior, the RKLD loss will diminish to zero.

Workflow#

OPD operates as an iterative loop: the student continuously generates new trajectories, which are immediately labeled by the teacher and used to update itself—enabling rapid adaptation to its own evolving behavior.

Experiments#

The following are the listed experiments displayed in the original blog:

1. Math Reasoning (AIME’24 Benchmark)#

Starting from a Qwen3-8B model first fine-tuned on 400K teacher-generated samples (achieving ~60% AIME accuracy), subsequent training with:
- Continued SFT (off-policy distillation): Requires ~2M samples to improve from 60% to 70% accuracy.
- RL (e.g., as in Qwen3 Technical Report): Achieves 67.6% at a cost of 17,920 GPU-hours.
- On-Policy Distillation: Reaches >70% accuracy in just ~150 training steps (~77K samples), costing only ~1,800 GPU-hours—9–30× more compute-efficient than RL.
The efficiency gain stems from avoiding reward modeling and policy gradient estimation—OPD uses only standard supervised learning on teacher-labeled student outputs.

2. Personalization & Continual Learning (Internal Assistant)#

After domain fine-tuning on internal docs, the model suffers catastrophic forgetting: instruction-following score (IF-eval) drops from 85% to 45–79%.
Applying On-Policy Distillation (using the original Qwen3-8B as teacher) on chat data:
- Recovers IF-eval to 83%,
- Preserves domain knowledge (internal QA stays at ~41%).
Demonstrates strong capability for continual learning without forgetting.

3. Single-Example Generalization#

Trained on only one AIME problem (5,120 sampled trajectories), the student achieves ~50% zero-shot accuracy on the full AIME’24 test set—demonstrating strong out-of-distribution generalization.
On-Policy Distillation still approaches the teacher’s overall AIME’24 performance,
Suggests it learns general reasoning strategies, not just memorization.

Summary#

On-Policy Distillation delivers superior performance, sample efficiency, and continual learning ability compared to SFT and RL. By providing dense, token-level supervision on the student’s own trajectories, it enables small models to closely mimic expert teachers—at a fraction of the cost.

This suggests that with the right training paradigm, small models can match or exceed large models in their domains—making scale less critical than alignment and feedback quality.

REFERENCES#

Lu, K., et al. “On-Policy Distillation.” Thinking Machines, 2024. https://thinkingmachines.ai/blog/on-policy-distillation/.