Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Liu, Zeli, Zhang, Jiancheng, Liu, Cong, Zhu, Yinglun
Format:	Preprint
Publié:	2026
Sujets:	Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2605.10075
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866916024112119808
author	Liu, Zeli Zhang, Jiancheng Liu, Cong Zhu, Yinglun
author_facet	Liu, Zeli Zhang, Jiancheng Liu, Cong Zhu, Yinglun
contents	Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_10075
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Active Testing of Large Language Models via Approximate Neyman Allocation Liu, Zeli Zhang, Jiancheng Liu, Cong Zhu, Yinglun Artificial Intelligence Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings.
title	Active Testing of Large Language Models via Approximate Neyman Allocation
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.10075

Documents similaires