Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Dorner, Florian E., Chen, Yatong, Cruz, André F., Yang, Fanny
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2507.12399
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915543071588352
author	Dorner, Florian E. Chen, Yatong Cruz, André F. Yang, Fanny
author_facet	Dorner, Florian E. Chen, Yatong Cruz, André F. Yang, Fanny
contents	Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_12399
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ROC-n-reroll: How verifier imperfection affects test-time scaling Dorner, Florian E. Chen, Yatong Cruz, André F. Yang, Fanny Machine Learning Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.
title	ROC-n-reroll: How verifier imperfection affects test-time scaling
topic	Machine Learning
url	https://arxiv.org/abs/2507.12399

Similar Items