Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Shunyu, Liu, Minghao, Zhou, Huichi, Cui, Zhenyu, Zhou, Yang, Zhou, Yuhao, Gao, Jialiang, Zhou, Heng, Yang, Yunhao, Fan, Wendong, zhang, puzhen, Zhang, Ge, Shi, Jiajun, Xuan, Weihao, Huang, Jiaxing, Luo, Shuang, Wu, Fang, Qi, Heli, Zeng, Qingcheng, Wang, Junjie, Feng, Aosong, Lv, Jindi, Jiang, Sicong, Ren, Ziqi, Zhou, Wangchunshu, Yin, Zhenfei, Zhang, Wenlong, Li, Guohao, Yu, Wenhao, Ma, Lei, Bai, Lei, Lin, Qunshu, Song, Mingli, Tao, Dacheng
Format:	Preprint
Published:	2025
Subjects:	Human-Computer Interaction
Online Access:	https://arxiv.org/abs/2508.04026
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915822505558016
author	Liu, Shunyu Liu, Minghao Zhou, Huichi Cui, Zhenyu Zhou, Yang Zhou, Yuhao Gao, Jialiang Zhou, Heng Yang, Yunhao Fan, Wendong zhang, puzhen Zhang, Ge Shi, Jiajun Xuan, Weihao Huang, Jiaxing Luo, Shuang Wu, Fang Qi, Heli Zeng, Qingcheng Wang, Junjie Feng, Aosong Lv, Jindi Jiang, Sicong Ren, Ziqi Zhou, Wangchunshu Yin, Zhenfei Zhang, Wenlong Li, Guohao Yu, Wenhao Ma, Lei Bai, Lei Lin, Qunshu Song, Mingli Tao, Dacheng
author_facet	Liu, Shunyu Liu, Minghao Zhou, Huichi Cui, Zhenyu Zhou, Yang Zhou, Yuhao Gao, Jialiang Zhou, Heng Yang, Yunhao Fan, Wendong zhang, puzhen Zhang, Ge Shi, Jiajun Xuan, Weihao Huang, Jiaxing Luo, Shuang Wu, Fang Qi, Heli Zeng, Qingcheng Wang, Junjie Feng, Aosong Lv, Jindi Jiang, Sicong Ren, Ziqi Zhou, Wangchunshu Yin, Zhenfei Zhang, Wenlong Li, Guohao Yu, Wenhao Ma, Lei Bai, Lei Lin, Qunshu Song, Mingli Tao, Dacheng
contents	Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web-based information-seeking tasks. However, existing efforts mainly focus on single-fact retrieval and rely on outcome-only verification, thereby limiting their scalability in realistic knowledge-intensive scenarios that involve long-horizon web tasks requiring large-scale retrieval and synthesis of information from diverse sources. In this work, we introduce VeriWeb, a novel verifiable long-chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions: (1) long-chain complexity, encompassing both breadth- and depth-oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and consistent context tracking in multi-hop reasoning; and (2) subtask-level verifiability, where tasks are decomposed into a sequence of interdependent verifiable subtasks. This structure enables diverse exploration strategies within each subtask, while ensuring that each subtask-level answer remains unchanged and verifiable. The benchmark consists of 302 tasks across five real-world domains, each with a complete trajectory demonstration, annotated by human experts. Extensive experiments on VeriWeb using various agents powered by different foundation models reveal significant performance gaps in handling long-horizon web tasks, highlighting the need for more powerful agentic information-seeking capabilities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_04026
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking Liu, Shunyu Liu, Minghao Zhou, Huichi Cui, Zhenyu Zhou, Yang Zhou, Yuhao Gao, Jialiang Zhou, Heng Yang, Yunhao Fan, Wendong zhang, puzhen Zhang, Ge Shi, Jiajun Xuan, Weihao Huang, Jiaxing Luo, Shuang Wu, Fang Qi, Heli Zeng, Qingcheng Wang, Junjie Feng, Aosong Lv, Jindi Jiang, Sicong Ren, Ziqi Zhou, Wangchunshu Yin, Zhenfei Zhang, Wenlong Li, Guohao Yu, Wenhao Ma, Lei Bai, Lei Lin, Qunshu Song, Mingli Tao, Dacheng Human-Computer Interaction Recent advances have showcased the extraordinary capabilities of Large Language Model (LLM) agents in tackling web-based information-seeking tasks. However, existing efforts mainly focus on single-fact retrieval and rely on outcome-only verification, thereby limiting their scalability in realistic knowledge-intensive scenarios that involve long-horizon web tasks requiring large-scale retrieval and synthesis of information from diverse sources. In this work, we introduce VeriWeb, a novel verifiable long-chain web benchmark designed to facilitate the evaluation and development of web agents within realistic web environments. Our benchmark emphasizes two critical dimensions: (1) long-chain complexity, encompassing both breadth- and depth-oriented search tasks to assess how effectively web agents ensure comprehensive information coverage and consistent context tracking in multi-hop reasoning; and (2) subtask-level verifiability, where tasks are decomposed into a sequence of interdependent verifiable subtasks. This structure enables diverse exploration strategies within each subtask, while ensuring that each subtask-level answer remains unchanged and verifiable. The benchmark consists of 302 tasks across five real-world domains, each with a complete trajectory demonstration, annotated by human experts. Extensive experiments on VeriWeb using various agents powered by different foundation models reveal significant performance gaps in handling long-horizon web tasks, highlighting the need for more powerful agentic information-seeking capabilities.
title	VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking
topic	Human-Computer Interaction
url	https://arxiv.org/abs/2508.04026

Similar Items