Salvato in:
Dettagli Bibliografici
Autori principali: Ye, Haotian, Zheng, Kaiwen, Xu, Jiashu, Li, Puheng, Chen, Huayu, Han, Jiaqi, Liu, Sheng, Zhang, Qinsheng, Mao, Hanzi, Hao, Zekun, Chattopadhyay, Prithvijit, Yang, Dinghao, Feng, Liang, Liao, Maosheng, Bai, Junjie, Liu, Ming-Yu, Zou, James, Ermon, Stefano
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2512.04332
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
Sommario:
  • Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.