Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Bagirov, Farid, Arkhipov, Mikhail, Sycheva, Ksenia, Glukhov, Evgeniy, Bogomolov, Egor
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2510.23393
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912672582205440
author	Bagirov, Farid Arkhipov, Mikhail Sycheva, Ksenia Glukhov, Evgeniy Bogomolov, Egor
author_facet	Bagirov, Farid Arkhipov, Mikhail Sycheva, Ksenia Glukhov, Evgeniy Bogomolov, Egor
contents	The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_23393
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation Bagirov, Farid Arkhipov, Mikhail Sycheva, Ksenia Glukhov, Evgeniy Bogomolov, Egor Machine Learning The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.
title	The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation
topic	Machine Learning
url	https://arxiv.org/abs/2510.23393

Similar Items