MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Zhang, Ming, Zhuang, Jiabao, Jing, Wenqing, Tan, Kexin, Kong, Ziyu, Deng, Jingyi, Shen, Yujiong, Wang, Yuhui, Xiang, Zhenghao, Peng, Qiyuan, Zhao, Yuhang, Luo, Ning, Zheng, Renzhe, Lin, Jiahui, Wu, Mingqi, Ma, Long, Dou, Shihan, Pan, Maxm, Gui, Tao, Zhang, Qi, Huang, Xuanjing
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Computation and Language
Accesso online:	https://arxiv.org/abs/2601.12369
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866910235583578112
author	Zhang, Ming Zhuang, Jiabao Jing, Wenqing Tan, Kexin Kong, Ziyu Deng, Jingyi Shen, Yujiong Wang, Yuhui Xiang, Zhenghao Peng, Qiyuan Zhao, Yuhang Luo, Ning Zheng, Renzhe Lin, Jiahui Wu, Mingqi Ma, Long Dou, Shihan Pan, Maxm Gui, Tao Zhang, Qi Huang, Xuanjing
author_facet	Zhang, Ming Zhuang, Jiabao Jing, Wenqing Tan, Kexin Kong, Ziyu Deng, Jingyi Shen, Yujiong Wang, Yuhui Xiang, Zhenghao Peng, Qiyuan Zhao, Yuhang Luo, Ning Zheng, Renzhe Lin, Jiahui Wu, Mingqi Ma, Long Dou, Shihan Pan, Maxm Gui, Tao Zhang, Qi Huang, Xuanjing
contents	Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent) groups. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference. On the alignment side, all 12 LLMs converge to Sem-Path 28-29%, well below 47-58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_12369
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies Zhang, Ming Zhuang, Jiabao Jing, Wenqing Tan, Kexin Kong, Ziyu Deng, Jingyi Shen, Yujiong Wang, Yuhui Xiang, Zhenghao Peng, Qiyuan Zhao, Yuhang Luo, Ning Zheng, Renzhe Lin, Jiahui Wu, Mingqi Ma, Long Dou, Shihan Pan, Maxm Gui, Tao Zhang, Qi Huang, Xuanjing Computation and Language Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent) groups. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference. On the alignment side, all 12 LLMs converge to Sem-Path 28-29%, well below 47-58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.
title	Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies
topic	Computation and Language
url	https://arxiv.org/abs/2601.12369

Documenti analoghi