Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shashidhar, Sumuk, Fourrier, Clémentine, Lozovskia, Alina, Wolf, Thomas, Tur, Gokhan, Hakkani-Tür, Dilek
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence I.2.1
Online Access:	https://arxiv.org/abs/2504.01833
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909562111524864
author	Shashidhar, Sumuk Fourrier, Clémentine Lozovskia, Alina Wolf, Thomas Tur, Gokhan Hakkani-Tür, Dilek
author_facet	Shashidhar, Sumuk Fourrier, Clémentine Lozovskia, Alina Wolf, Thomas Tur, Gokhan Hakkani-Tür, Dilek
contents	Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_01833
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	YourBench: Easy Custom Evaluation Sets for Everyone Shashidhar, Sumuk Fourrier, Clémentine Lozovskia, Alina Wolf, Thomas Tur, Gokhan Hakkani-Tür, Dilek Computation and Language Artificial Intelligence I.2.1 Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.
title	YourBench: Easy Custom Evaluation Sets for Everyone
topic	Computation and Language Artificial Intelligence I.2.1
url	https://arxiv.org/abs/2504.01833

Similar Items