Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Jiahao, Zhang, Lujing, Grimes, Keltin, Yu, Zhuohao, Swamy, Gokul, Wu, Zhiwei Steven
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.19041
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913091671818240
author	Zhang, Jiahao Zhang, Lujing Grimes, Keltin Yu, Zhuohao Swamy, Gokul Wu, Zhiwei Steven
author_facet	Zhang, Jiahao Zhang, Lujing Grimes, Keltin Yu, Zhuohao Swamy, Gokul Wu, Zhiwei Steven
contents	A recurring challenge in preference fine-tuning (PFT) is handling $\textit{intransitive}$ (i.e., cyclic) preferences. Intransitive preferences often stem from either $\textit{(i)}$ inconsistent rankings along a single objective or $\textit{(ii)}$ scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept, the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$), that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive $\texttt{PROSPER}$: a provably efficient PFT algorithm. Unlike prior self-play techniques, $\texttt{PROSPER}$ directly handles multiple objectives without requiring scalarization. We then apply $\texttt{PROSPER}$ to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that $\texttt{PROSPER}$ outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_19041
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning Zhang, Jiahao Zhang, Lujing Grimes, Keltin Yu, Zhuohao Swamy, Gokul Wu, Zhiwei Steven Machine Learning A recurring challenge in preference fine-tuning (PFT) is handling $\textit{intransitive}$ (i.e., cyclic) preferences. Intransitive preferences often stem from either $\textit{(i)}$ inconsistent rankings along a single objective or $\textit{(ii)}$ scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept, the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$), that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive $\texttt{PROSPER}$: a provably efficient PFT algorithm. Unlike prior self-play techniques, $\texttt{PROSPER}$ directly handles multiple objectives without requiring scalarization. We then apply $\texttt{PROSPER}$ to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that $\texttt{PROSPER}$ outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales.
title	Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning
topic	Machine Learning
url	https://arxiv.org/abs/2602.19041

Similar Items