Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.14607 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916656590094336 |
|---|---|
| author | Xing, Shuo Sun, Zezhou Xie, Shuangyu Chen, Kaiyuan Huang, Yanjia Wang, Yuping Li, Jiachen Song, Dezhen Tu, Zhengzhong |
| author_facet | Xing, Shuo Sun, Zezhou Xie, Shuangyu Chen, Kaiyuan Huang, Yanjia Wang, Yuping Li, Jiachen Song, Dezhen Tu, Zhengzhong |
| contents | In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2503_14607 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Can Large Vision Language Models Read Maps Like a Human? Xing, Shuo Sun, Zezhou Xie, Shuangyu Chen, Kaiyuan Huang, Yanjia Wang, Yuping Li, Jiachen Song, Dezhen Tu, Zhengzhong Computer Vision and Pattern Recognition In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench. |
| title | Can Large Vision Language Models Read Maps Like a Human? |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2503.14607 |