Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.19041 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913091671818240 |
|---|---|
| author | Zhang, Jiahao Zhang, Lujing Grimes, Keltin Yu, Zhuohao Swamy, Gokul Wu, Zhiwei Steven |
| author_facet | Zhang, Jiahao Zhang, Lujing Grimes, Keltin Yu, Zhuohao Swamy, Gokul Wu, Zhiwei Steven |
| contents | A recurring challenge in preference fine-tuning (PFT) is handling $\textit{intransitive}$ (i.e., cyclic) preferences. Intransitive preferences often stem from either $\textit{(i)}$ inconsistent rankings along a single objective or $\textit{(ii)}$ scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept, the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$), that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive $\texttt{PROSPER}$: a provably efficient PFT algorithm. Unlike prior self-play techniques, $\texttt{PROSPER}$ directly handles multiple objectives without requiring scalarization. We then apply $\texttt{PROSPER}$ to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that $\texttt{PROSPER}$ outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_19041 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning Zhang, Jiahao Zhang, Lujing Grimes, Keltin Yu, Zhuohao Swamy, Gokul Wu, Zhiwei Steven Machine Learning A recurring challenge in preference fine-tuning (PFT) is handling $\textit{intransitive}$ (i.e., cyclic) preferences. Intransitive preferences often stem from either $\textit{(i)}$ inconsistent rankings along a single objective or $\textit{(ii)}$ scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept, the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$), that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive $\texttt{PROSPER}$: a provably efficient PFT algorithm. Unlike prior self-play techniques, $\texttt{PROSPER}$ directly handles multiple objectives without requiring scalarization. We then apply $\texttt{PROSPER}$ to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that $\texttt{PROSPER}$ outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales. |
| title | Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2602.19041 |