Saved in:
Bibliographic Details
Main Authors: Wang, Zexuan, Yang, Chenghao, Que, Yingqi, Yang, Zhenzhu, Yuan, Huaqing, Wang, Yiwen, Jiang, Zhengxuan, Fang, Shengjie, Wu, Zhenhe, Wang, Zhaohui, Yao, Zhixin, Liu, Jiashuo, Ren, Jincheng, Li, Yuzhen, Yang, Yang, Liu, Jiaheng, Yang, Jian, Wang, Zaiyuan, Zhang, Ge, Wen, Zhoufutu, Huang, Wenhao
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.08367
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917259658657792
author Wang, Zexuan
Yang, Chenghao
Que, Yingqi
Yang, Zhenzhu
Yuan, Huaqing
Wang, Yiwen
Jiang, Zhengxuan
Fang, Shengjie
Wu, Zhenhe
Wang, Zhaohui
Yao, Zhixin
Liu, Jiashuo
Ren, Jincheng
Li, Yuzhen
Yang, Yang
Liu, Jiaheng
Yang, Jian
Wang, Zaiyuan
Zhang, Ge
Wen, Zhoufutu
Huang, Wenhao
author_facet Wang, Zexuan
Yang, Chenghao
Que, Yingqi
Yang, Zhenzhu
Yuan, Huaqing
Wang, Yiwen
Jiang, Zhengxuan
Fang, Shengjie
Wu, Zhenhe
Wang, Zhaohui
Yao, Zhixin
Liu, Jiashuo
Ren, Jincheng
Li, Yuzhen
Yang, Yang
Liu, Jiaheng
Yang, Jian
Wang, Zaiyuan
Zhang, Ge
Wen, Zhoufutu
Huang, Wenhao
contents Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67\% feasibility in text-only settings, which plummets to 19.33\% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.
format Preprint
id arxiv_https___arxiv_org_abs_2602_08367
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints
Wang, Zexuan
Yang, Chenghao
Que, Yingqi
Yang, Zhenzhu
Yuan, Huaqing
Wang, Yiwen
Jiang, Zhengxuan
Fang, Shengjie
Wu, Zhenhe
Wang, Zhaohui
Yao, Zhixin
Liu, Jiashuo
Ren, Jincheng
Li, Yuzhen
Yang, Yang
Liu, Jiaheng
Yang, Jian
Wang, Zaiyuan
Zhang, Ge
Wen, Zhoufutu
Huang, Wenhao
Computation and Language
Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67\% feasibility in text-only settings, which plummets to 19.33\% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.
title WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints
topic Computation and Language
url https://arxiv.org/abs/2602.08367