Saved in:
Bibliographic Details
Main Authors: Umer, Muhammad, Mohsin, Muhammad Ahmed, Bilal, Ahsan, Chaudhry, Arslan, Haupt, Andreas, Koyejo, Sanmi, Fox, Emily, Cioffi, John M.
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.18721
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913151905169408
author Umer, Muhammad
Mohsin, Muhammad Ahmed
Bilal, Ahsan
Chaudhry, Arslan
Haupt, Andreas
Koyejo, Sanmi
Fox, Emily
Cioffi, John M.
author_facet Umer, Muhammad
Mohsin, Muhammad Ahmed
Bilal, Ahsan
Chaudhry, Arslan
Haupt, Andreas
Koyejo, Sanmi
Fox, Emily
Cioffi, John M.
contents Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.
format Preprint
id arxiv_https___arxiv_org_abs_2605_18721
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle General Preference Reinforcement Learning
Umer, Muhammad
Mohsin, Muhammad Ahmed
Bilal, Ahsan
Chaudhry, Arslan
Haupt, Andreas
Koyejo, Sanmi
Fox, Emily
Cioffi, John M.
Machine Learning
Computation and Language
Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.
title General Preference Reinforcement Learning
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2605.18721