Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ruan, Jie, Nair, Inderjeet, Cao, Shuyang, Liu, Amy, Munir, Sheza, Pollens-Dempsey, Micah, Chiang, Tiffany, Kates, Lucy, David, Nicholas, Chen, Sihan, Yang, Ruxin, Yang, Yuqian, Gump, Jasmine, Bialek, Tessa, Sankaran, Vivek, Schlanger, Margo, Wang, Lu
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2506.01241
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911194444464128
author	Ruan, Jie Nair, Inderjeet Cao, Shuyang Liu, Amy Munir, Sheza Pollens-Dempsey, Micah Chiang, Tiffany Kates, Lucy David, Nicholas Chen, Sihan Yang, Ruxin Yang, Yuqian Gump, Jasmine Bialek, Tessa Sankaran, Vivek Schlanger, Margo Wang, Lu
author_facet	Ruan, Jie Nair, Inderjeet Cao, Shuyang Liu, Amy Munir, Sheza Pollens-Dempsey, Micah Chiang, Tiffany Kates, Lucy David, Nicholas Chen, Sihan Yang, Ruxin Yang, Yuqian Gump, Jasmine Bialek, Tessa Sankaran, Vivek Schlanger, Margo Wang, Lu
contents	This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items of model outputs are then compared with corresponding items of reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 13 popular large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer Gemini-2.5-Pro achieving only a 33.4 F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, but far from correct; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable, reproducible, and low-cost usage.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_01241
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists Ruan, Jie Nair, Inderjeet Cao, Shuyang Liu, Amy Munir, Sheza Pollens-Dempsey, Micah Chiang, Tiffany Kates, Lucy David, Nicholas Chen, Sihan Yang, Ruxin Yang, Yuqian Gump, Jasmine Bialek, Tessa Sankaran, Vivek Schlanger, Margo Wang, Lu Computation and Language This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items of model outputs are then compared with corresponding items of reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 13 popular large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer Gemini-2.5-Pro achieving only a 33.4 F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, but far from correct; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable, reproducible, and low-cost usage.
title	ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists
topic	Computation and Language
url	https://arxiv.org/abs/2506.01241

Similar Items