Saved in:
Bibliographic Details
Main Authors: Pham, Khiem, Nguyen, Quang, Nguyen, Tung, Zhu, Jingsen, Santacatterina, Michele, Metaxas, Dimitris, Zabih, Ramin
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.06195
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915779131211776
author Pham, Khiem
Nguyen, Quang
Nguyen, Tung
Zhu, Jingsen
Santacatterina, Michele
Metaxas, Dimitris
Zabih, Ramin
author_facet Pham, Khiem
Nguyen, Quang
Nguyen, Tung
Zhu, Jingsen
Santacatterina, Michele
Metaxas, Dimitris
Zabih, Ramin
contents Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.
format Preprint
id arxiv_https___arxiv_org_abs_2602_06195
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle DeDPO: Debiased Direct Preference Optimization for Diffusion Models
Pham, Khiem
Nguyen, Quang
Nguyen, Tung
Zhu, Jingsen
Santacatterina, Michele
Metaxas, Dimitris
Zabih, Ramin
Computer Vision and Pattern Recognition
Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.
title DeDPO: Debiased Direct Preference Optimization for Diffusion Models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.06195