Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Xu, Yifan, Ye, Xichen, Chen, Yifan, Zhang, Qiaosheng
Formato:	Preprint
Publicado:	2025
Materias:	Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2512.00709
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866914174863409152
author	Xu, Yifan Ye, Xichen Chen, Yifan Zhang, Qiaosheng
author_facet	Xu, Yifan Ye, Xichen Chen, Yifan Zhang, Qiaosheng
contents	Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_00709
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF Xu, Yifan Ye, Xichen Chen, Yifan Zhang, Qiaosheng Artificial Intelligence Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.
title	When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF
topic	Artificial Intelligence
url	https://arxiv.org/abs/2512.00709

Ejemplares similares