Guardado en:
Detalles Bibliográficos
Autores principales: Ye, Haotian, Zheng, Kaiwen, Xu, Jiashu, Li, Puheng, Chen, Huayu, Han, Jiaqi, Liu, Sheng, Zhang, Qinsheng, Mao, Hanzi, Hao, Zekun, Chattopadhyay, Prithvijit, Yang, Dinghao, Feng, Liang, Liao, Maosheng, Bai, Junjie, Liu, Ming-Yu, Zou, James, Ermon, Stefano
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2512.04332
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866912787394985984
author Ye, Haotian
Zheng, Kaiwen
Xu, Jiashu
Li, Puheng
Chen, Huayu
Han, Jiaqi
Liu, Sheng
Zhang, Qinsheng
Mao, Hanzi
Hao, Zekun
Chattopadhyay, Prithvijit
Yang, Dinghao
Feng, Liang
Liao, Maosheng
Bai, Junjie
Liu, Ming-Yu
Zou, James
Ermon, Stefano
author_facet Ye, Haotian
Zheng, Kaiwen
Xu, Jiashu
Li, Puheng
Chen, Huayu
Han, Jiaqi
Liu, Sheng
Zhang, Qinsheng
Mao, Hanzi
Hao, Zekun
Chattopadhyay, Prithvijit
Yang, Dinghao
Feng, Liang
Liao, Maosheng
Bai, Junjie
Liu, Ming-Yu
Zou, James
Ermon, Stefano
contents Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.
format Preprint
id arxiv_https___arxiv_org_abs_2512_04332
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Data-regularized Reinforcement Learning for Diffusion Models at Scale
Ye, Haotian
Zheng, Kaiwen
Xu, Jiashu
Li, Puheng
Chen, Huayu
Han, Jiaqi
Liu, Sheng
Zhang, Qinsheng
Mao, Hanzi
Hao, Zekun
Chattopadhyay, Prithvijit
Yang, Dinghao
Feng, Liang
Liao, Maosheng
Bai, Junjie
Liu, Ming-Yu
Zou, James
Ermon, Stefano
Machine Learning
Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.
title Data-regularized Reinforcement Learning for Diffusion Models at Scale
topic Machine Learning
url https://arxiv.org/abs/2512.04332