_version_ 1866915822505558016
author Liu, Shunyu
Liu, Minghao
Zhou, Huichi
Cui, Zhenyu
Zhou, Yang
Zhou, Yuhao
Gao, Jialiang
Zhou, Heng
Yang, Yunhao
Fan, Wendong
zhang, puzhen
Zhang, Ge
Shi, Jiajun
Xuan, Weihao
Huang, Jiaxing
Luo, Shuang
Wu, Fang
Qi, Heli
Zeng, Qingcheng
Wang, Junjie
Feng, Aosong
Lv, Jindi
Jiang, Sicong
Ren, Ziqi
Zhou, Wangchunshu
Yin, Zhenfei
Zhang, Wenlong
Li, Guohao
Yu, Wenhao
Ma, Lei
Bai, Lei
Lin, Qunshu
Song, Mingli
Tao, Dacheng
author_facet Liu, Shunyu
Liu, Minghao
Zhou, Huichi
Cui, Zhenyu
Zhou, Yang
Zhou, Yuhao
Gao, Jialiang
Zhou, Heng
Yang, Yunhao
Fan, Wendong
zhang, puzhen
Zhang, Ge
Shi, Jiajun
Xuan, Weihao
Huang, Jiaxing
Luo, Shuang
Wu, Fang
Qi, Heli
Zeng, Qingcheng
Wang, Junjie
Feng, Aosong
Lv, Jindi
Jiang, Sicong
Ren, Ziqi
Zhou, Wangchunshu
Yin, Zhenfei
Zhang, Wenlong
Li, Guohao
Yu, Wenhao
Ma, Lei
Bai, Lei
Lin, Qunshu
Song, Mingli
Tao, Dacheng
contents Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web-based information-seeking tasks. However, existing efforts mainly focus on single-fact retrieval and rely on outcome-only verification, thereby limiting their scalability in realistic knowledge-intensive scenarios that involve long-horizon web tasks requiring large-scale retrieval and synthesis of information from diverse sources. In this work, we introduce VeriWeb, a novel verifiable long-chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions: (1) long-chain complexity, encompassing both breadth- and depth-oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and consistent context tracking in multi-hop reasoning; and (2) subtask-level verifiability, where tasks are decomposed into a sequence of interdependent verifiable subtasks. This structure enables diverse exploration strategies within each subtask, while ensuring that each subtask-level answer remains unchanged and verifiable. The benchmark consists of 302 tasks across five real-world domains, each with a complete trajectory demonstration, annotated by human experts. Extensive experiments on VeriWeb using various agents powered by different foundation models reveal significant performance gaps in handling long-horizon web tasks, highlighting the need for more powerful agentic information-seeking capabilities.
format Preprint
id arxiv_https___arxiv_org_abs_2508_04026
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking
Liu, Shunyu
Liu, Minghao
Zhou, Huichi
Cui, Zhenyu
Zhou, Yang
Zhou, Yuhao
Gao, Jialiang
Zhou, Heng
Yang, Yunhao
Fan, Wendong
zhang, puzhen
Zhang, Ge
Shi, Jiajun
Xuan, Weihao
Huang, Jiaxing
Luo, Shuang
Wu, Fang
Qi, Heli
Zeng, Qingcheng
Wang, Junjie
Feng, Aosong
Lv, Jindi
Jiang, Sicong
Ren, Ziqi
Zhou, Wangchunshu
Yin, Zhenfei
Zhang, Wenlong
Li, Guohao
Yu, Wenhao
Ma, Lei
Bai, Lei
Lin, Qunshu
Song, Mingli
Tao, Dacheng
Human-Computer Interaction
Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web-based information-seeking tasks. However, existing efforts mainly focus on single-fact retrieval and rely on outcome-only verification, thereby limiting their scalability in realistic knowledge-intensive scenarios that involve long-horizon web tasks requiring large-scale retrieval and synthesis of information from diverse sources. In this work, we introduce VeriWeb, a novel verifiable long-chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions: (1) long-chain complexity, encompassing both breadth- and depth-oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and consistent context tracking in multi-hop reasoning; and (2) subtask-level verifiability, where tasks are decomposed into a sequence of interdependent verifiable subtasks. This structure enables diverse exploration strategies within each subtask, while ensuring that each subtask-level answer remains unchanged and verifiable. The benchmark consists of 302 tasks across five real-world domains, each with a complete trajectory demonstration, annotated by human experts. Extensive experiments on VeriWeb using various agents powered by different foundation models reveal significant performance gaps in handling long-horizon web tasks, highlighting the need for more powerful agentic information-seeking capabilities.
title VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking
topic Human-Computer Interaction
url https://arxiv.org/abs/2508.04026