DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models
Abstract
DiffusionOPD enables efficient multi-task training for diffusion models through online policy distillation, outperforming existing reinforcement learning approaches in both training efficiency and final performance.
Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Flow-OPD: On-Policy Distillation for Flow Matching Models (2026)
- $R_\text{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation (2026)
- V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think (2026)
- Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes (2026)
- Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning (2026)
- UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models (2026)
- The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.15055 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper