Saved in:
Bibliographic Details
Main Authors: Matrenok, Simon, Moalla, Skander, Gulcehre, Caglar
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.08068
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912737272004608
author Matrenok, Simon
Moalla, Skander
Gulcehre, Caglar
author_facet Matrenok, Simon
Moalla, Skander
Gulcehre, Caglar
contents Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations--reward model scores, AlpacaEval 2, and LeetCode--compared to DPO, REBEL, and SimPO across diverse datasets and 8B-scale models. Finally, we find that training with robust rewards instead of converting them to preferences induces less length bias.
format Preprint
id arxiv_https___arxiv_org_abs_2507_08068
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions
Matrenok, Simon
Moalla, Skander
Gulcehre, Caglar
Machine Learning
Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce Quantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations--reward model scores, AlpacaEval 2, and LeetCode--compared to DPO, REBEL, and SimPO across diverse datasets and 8B-scale models. Finally, we find that training with robust rewards instead of converting them to preferences induces less length bias.
title Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions
topic Machine Learning
url https://arxiv.org/abs/2507.08068