Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Lee, Wilson Y.
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2601.09084
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917204854833152
author	Lee, Wilson Y.
author_facet	Lee, Wilson Y.
contents	Human preference evaluations are widely used to compare generative models, yet it remains unclear how many judgments are required to reliably detect small improvements. We show that when preference signal is diffuse across prompts (i.e., all prompt types are similarly informative), proportional allocation is minimax-optimal: no allocation strategy substantially improves detectability. Empirical analysis of large-scale human preference datasets shows that most comparisons fall into this diffuse regime, exhibiting small preference margins that require far more judgments than typically collected, even in well-sampled comparisons. These limits persist across evaluation protocols and modalities, including chat, image generation, and code generation with execution feedback. In contrast, curated benchmarks that reduce prompt induced variability systematically induce larger margins and improve detectability through a $1.5\times$ reduction in prompt-level variance. Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence, underscoring the need to account explicitly for effect size, budget, and protocol design.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_09084
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation Lee, Wilson Y. Computation and Language Machine Learning Human preference evaluations are widely used to compare generative models, yet it remains unclear how many judgments are required to reliably detect small improvements. We show that when preference signal is diffuse across prompts (i.e., all prompt types are similarly informative), proportional allocation is minimax-optimal: no allocation strategy substantially improves detectability. Empirical analysis of large-scale human preference datasets shows that most comparisons fall into this diffuse regime, exhibiting small preference margins that require far more judgments than typically collected, even in well-sampled comparisons. These limits persist across evaluation protocols and modalities, including chat, image generation, and code generation with execution feedback. In contrast, curated benchmarks that reduce prompt induced variability systematically induce larger margins and improve detectability through a $1.5\times$ reduction in prompt-level variance. Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence, underscoring the need to account explicitly for effect size, budget, and protocol design.
title	How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2601.09084

Similar Items