Saved in:
Bibliografiske detaljer
Main Authors: Zhu, Rui, Shen, Xin, Wu, Shuchen, Miao, Chenxi, Yu, Xin, Li, Yang, Li, Weikang, Xia, Deguo, Huang, Jizhou
Format: Preprint
Udgivet: 2026
Fag:
Online adgang:https://arxiv.org/abs/2601.09430
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!
_version_ 1866915729683513344
author Zhu, Rui
Shen, Xin
Wu, Shuchen
Miao, Chenxi
Yu, Xin
Li, Yang
Li, Weikang
Xia, Deguo
Huang, Jizhou
author_facet Zhu, Rui
Shen, Xin
Wu, Shuchen
Miao, Chenxi
Yu, Xin
Li, Yang
Li, Weikang
Xia, Deguo
Huang, Jizhou
contents Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR.
format Preprint
id arxiv_https___arxiv_org_abs_2601_09430
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs
Zhu, Rui
Shen, Xin
Wu, Shuchen
Miao, Chenxi
Yu, Xin
Li, Yang
Li, Weikang
Xia, Deguo
Huang, Jizhou
Computer Vision and Pattern Recognition
Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR.
title Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2601.09430