Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.04026 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915822505558016 |
|---|---|
| author | Liu, Shunyu Liu, Minghao Zhou, Huichi Cui, Zhenyu Zhou, Yang Zhou, Yuhao Gao, Jialiang Zhou, Heng Yang, Yunhao Fan, Wendong zhang, puzhen Zhang, Ge Shi, Jiajun Xuan, Weihao Huang, Jiaxing Luo, Shuang Wu, Fang Qi, Heli Zeng, Qingcheng Wang, Junjie Feng, Aosong Lv, Jindi Jiang, Sicong Ren, Ziqi Zhou, Wangchunshu Yin, Zhenfei Zhang, Wenlong Li, Guohao Yu, Wenhao Ma, Lei Bai, Lei Lin, Qunshu Song, Mingli Tao, Dacheng |
| author_facet | Liu, Shunyu Liu, Minghao Zhou, Huichi Cui, Zhenyu Zhou, Yang Zhou, Yuhao Gao, Jialiang Zhou, Heng Yang, Yunhao Fan, Wendong zhang, puzhen Zhang, Ge Shi, Jiajun Xuan, Weihao Huang, Jiaxing Luo, Shuang Wu, Fang Qi, Heli Zeng, Qingcheng Wang, Junjie Feng, Aosong Lv, Jindi Jiang, Sicong Ren, Ziqi Zhou, Wangchunshu Yin, Zhenfei Zhang, Wenlong Li, Guohao Yu, Wenhao Ma, Lei Bai, Lei Lin, Qunshu Song, Mingli Tao, Dacheng |
| contents | Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web-based information-seeking tasks. However, existing efforts mainly focus on single-fact retrieval and rely on outcome-only verification, thereby limiting their scalability in realistic knowledge-intensive scenarios that involve long-horizon web tasks requiring large-scale retrieval and synthesis of information from diverse sources. In this work, we introduce VeriWeb, a novel verifiable long-chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions: (1) long-chain complexity, encompassing both breadth- and depth-oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and consistent context tracking in multi-hop reasoning; and (2) subtask-level verifiability, where tasks are decomposed into a sequence of interdependent verifiable subtasks. This structure enables diverse exploration strategies within each subtask, while ensuring that each subtask-level answer remains unchanged and verifiable. The benchmark consists of 302 tasks across five real-world domains, each with a complete trajectory demonstration, annotated by human experts. Extensive experiments on VeriWeb using various agents powered by different foundation models reveal significant performance gaps in handling long-horizon web tasks, highlighting the need for more powerful agentic information-seeking capabilities. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2508_04026 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking Liu, Shunyu Liu, Minghao Zhou, Huichi Cui, Zhenyu Zhou, Yang Zhou, Yuhao Gao, Jialiang Zhou, Heng Yang, Yunhao Fan, Wendong zhang, puzhen Zhang, Ge Shi, Jiajun Xuan, Weihao Huang, Jiaxing Luo, Shuang Wu, Fang Qi, Heli Zeng, Qingcheng Wang, Junjie Feng, Aosong Lv, Jindi Jiang, Sicong Ren, Ziqi Zhou, Wangchunshu Yin, Zhenfei Zhang, Wenlong Li, Guohao Yu, Wenhao Ma, Lei Bai, Lei Lin, Qunshu Song, Mingli Tao, Dacheng Human-Computer Interaction Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web-based information-seeking tasks. However, existing efforts mainly focus on single-fact retrieval and rely on outcome-only verification, thereby limiting their scalability in realistic knowledge-intensive scenarios that involve long-horizon web tasks requiring large-scale retrieval and synthesis of information from diverse sources. In this work, we introduce VeriWeb, a novel verifiable long-chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions: (1) long-chain complexity, encompassing both breadth- and depth-oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and consistent context tracking in multi-hop reasoning; and (2) subtask-level verifiability, where tasks are decomposed into a sequence of interdependent verifiable subtasks. This structure enables diverse exploration strategies within each subtask, while ensuring that each subtask-level answer remains unchanged and verifiable. The benchmark consists of 302 tasks across five real-world domains, each with a complete trajectory demonstration, annotated by human experts. Extensive experiments on VeriWeb using various agents powered by different foundation models reveal significant performance gaps in handling long-horizon web tasks, highlighting the need for more powerful agentic information-seeking capabilities. |
| title | VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking |
| topic | Human-Computer Interaction |
| url | https://arxiv.org/abs/2508.04026 |