Salvato in:
Dettagli Bibliografici
Autori principali: Zhang, Ming, Zhuang, Jiabao, Jing, Wenqing, Tan, Kexin, Kong, Ziyu, Deng, Jingyi, Shen, Yujiong, Wang, Yuhui, Xiang, Zhenghao, Peng, Qiyuan, Zhao, Yuhang, Luo, Ning, Zheng, Renzhe, Lin, Jiahui, Wu, Mingqi, Ma, Long, Dou, Shihan, Pan, Maxm, Gui, Tao, Zhang, Qi, Huang, Xuanjing
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2601.12369
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866910235583578112
author Zhang, Ming
Zhuang, Jiabao
Jing, Wenqing
Tan, Kexin
Kong, Ziyu
Deng, Jingyi
Shen, Yujiong
Wang, Yuhui
Xiang, Zhenghao
Peng, Qiyuan
Zhao, Yuhang
Luo, Ning
Zheng, Renzhe
Lin, Jiahui
Wu, Mingqi
Ma, Long
Dou, Shihan
Pan, Maxm
Gui, Tao
Zhang, Qi
Huang, Xuanjing
author_facet Zhang, Ming
Zhuang, Jiabao
Jing, Wenqing
Tan, Kexin
Kong, Ziyu
Deng, Jingyi
Shen, Yujiong
Wang, Yuhui
Xiang, Zhenghao
Peng, Qiyuan
Zhao, Yuhang
Luo, Ning
Zheng, Renzhe
Lin, Jiahui
Wu, Mingqi
Ma, Long
Dou, Shihan
Pan, Maxm
Gui, Tao
Zhang, Qi
Huang, Xuanjing
contents Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent) groups. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference. On the alignment side, all 12 LLMs converge to Sem-Path 28-29%, well below 47-58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.
format Preprint
id arxiv_https___arxiv_org_abs_2601_12369
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies
Zhang, Ming
Zhuang, Jiabao
Jing, Wenqing
Tan, Kexin
Kong, Ziyu
Deng, Jingyi
Shen, Yujiong
Wang, Yuhui
Xiang, Zhenghao
Peng, Qiyuan
Zhao, Yuhang
Luo, Ning
Zheng, Renzhe
Lin, Jiahui
Wu, Mingqi
Ma, Long
Dou, Shihan
Pan, Maxm
Gui, Tao
Zhang, Qi
Huang, Xuanjing
Computation and Language
Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent) groups. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference. On the alignment side, all 12 LLMs converge to Sem-Path 28-29%, well below 47-58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.
title Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies
topic Computation and Language
url https://arxiv.org/abs/2601.12369