Saved in:
Bibliographic Details
Main Authors: Xing, Shuo, Sun, Zezhou, Xie, Shuangyu, Chen, Kaiyuan, Huang, Yanjia, Wang, Yuping, Li, Jiachen, Song, Dezhen, Tu, Zhengzhong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.14607
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916656590094336
author Xing, Shuo
Sun, Zezhou
Xie, Shuangyu
Chen, Kaiyuan
Huang, Yanjia
Wang, Yuping
Li, Jiachen
Song, Dezhen
Tu, Zhengzhong
author_facet Xing, Shuo
Sun, Zezhou
Xie, Shuangyu
Chen, Kaiyuan
Huang, Yanjia
Wang, Yuping
Li, Jiachen
Song, Dezhen
Tu, Zhengzhong
contents In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.
format Preprint
id arxiv_https___arxiv_org_abs_2503_14607
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Can Large Vision Language Models Read Maps Like a Human?
Xing, Shuo
Sun, Zezhou
Xie, Shuangyu
Chen, Kaiyuan
Huang, Yanjia
Wang, Yuping
Li, Jiachen
Song, Dezhen
Tu, Zhengzhong
Computer Vision and Pattern Recognition
In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.
title Can Large Vision Language Models Read Maps Like a Human?
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2503.14607