Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Preprint |
| Udgivet: |
2026
|
| Fag: | |
| Online adgang: | https://arxiv.org/abs/2601.09430 |
| Tags: |
Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!
|
| _version_ | 1866915729683513344 |
|---|---|
| author | Zhu, Rui Shen, Xin Wu, Shuchen Miao, Chenxi Yu, Xin Li, Yang Li, Weikang Xia, Deguo Huang, Jizhou |
| author_facet | Zhu, Rui Shen, Xin Wu, Shuchen Miao, Chenxi Yu, Xin Li, Yang Li, Weikang Xia, Deguo Huang, Jizhou |
| contents | Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_09430 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs Zhu, Rui Shen, Xin Wu, Shuchen Miao, Chenxi Yu, Xin Li, Yang Li, Weikang Xia, Deguo Huang, Jizhou Computer Vision and Pattern Recognition Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR. |
| title | Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2601.09430 |