Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhu, Longyuan, Hua, Hairan, Miao, Linlin, Zhao, Bing
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.11674
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917270147563520
author	Zhu, Longyuan Hua, Hairan Miao, Linlin Zhao, Bing
author_facet	Zhu, Longyuan Hua, Hairan Miao, Linlin Zhao, Bing
contents	Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next-generation evaluation protocols.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_11674
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs Zhu, Longyuan Hua, Hairan Miao, Linlin Zhao, Bing Artificial Intelligence Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next-generation evaluation protocols.
title	Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs
topic	Artificial Intelligence
url	https://arxiv.org/abs/2602.11674

Similar Items