Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sun, Hui, Zhang, Yun-Ji, Xie, Zheng, Liu, Ren-Biao, Du, Yali, Li, Xin-Ye, Li, Ming
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2604.03922
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911567858106368
author	Sun, Hui Zhang, Yun-Ji Xie, Zheng Liu, Ren-Biao Du, Yali Li, Xin-Ye Li, Ming
author_facet	Sun, Hui Zhang, Yun-Ji Xie, Zheng Liu, Ren-Biao Du, Yali Li, Xin-Ye Li, Ming
contents	Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_03922
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation Sun, Hui Zhang, Yun-Ji Xie, Zheng Liu, Ren-Biao Du, Yali Li, Xin-Ye Li, Ming Machine Learning Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.
title	ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation
topic	Machine Learning
url	https://arxiv.org/abs/2604.03922

Similar Items