Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Matrenok, Simon, Moalla, Skander, Gulcehre, Caglar
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2507.08068
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912737272004608
author	Matrenok, Simon Moalla, Skander Gulcehre, Caglar
author_facet	Matrenok, Simon Moalla, Skander Gulcehre, Caglar
contents	Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations--reward model scores, AlpacaEval 2, and LeetCode--compared to DPO, REBEL, and SimPO across diverse datasets and 8B-scale models. Finally, we find that training with robust rewards instead of converting them to preferences induces less length bias.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_08068
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions Matrenok, Simon Moalla, Skander Gulcehre, Caglar Machine Learning Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations--reward model scores, AlpacaEval 2, and LeetCode--compared to DPO, REBEL, and SimPO across diverse datasets and 8B-scale models. Finally, we find that training with robust rewards instead of converting them to preferences induces less length bias.
title	Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions
topic	Machine Learning
url	https://arxiv.org/abs/2507.08068

Similar Items