Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Umer, Muhammad, Mohsin, Muhammad Ahmed, Bilal, Ahsan, Chaudhry, Arslan, Haupt, Andreas, Koyejo, Sanmi, Fox, Emily, Cioffi, John M.
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2605.18721
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913151905169408
author	Umer, Muhammad Mohsin, Muhammad Ahmed Bilal, Ahsan Chaudhry, Arslan Haupt, Andreas Koyejo, Sanmi Fox, Emily Cioffi, John M.
author_facet	Umer, Muhammad Mohsin, Muhammad Ahmed Bilal, Ahsan Chaudhry, Arslan Haupt, Andreas Koyejo, Sanmi Fox, Emily Cioffi, John M.
contents	Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_18721
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	General Preference Reinforcement Learning Umer, Muhammad Mohsin, Muhammad Ahmed Bilal, Ahsan Chaudhry, Arslan Haupt, Andreas Koyejo, Sanmi Fox, Emily Cioffi, John M. Machine Learning Computation and Language Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.
title	General Preference Reinforcement Learning
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2605.18721

Similar Items