Medarbejdervisning: :: Library Catalog

Saved in:

Bibliografiske detaljer
Main Authors:	Zhu, Rui, Shen, Xin, Wu, Shuchen, Miao, Chenxi, Yu, Xin, Li, Yang, Li, Weikang, Xia, Deguo, Huang, Jizhou
Format:	Preprint
Udgivet:	2026
Fag:	Computer Vision and Pattern Recognition
Online adgang:	https://arxiv.org/abs/2601.09430
Tags:	Tilføj Tag Ingen Tags, Vær først til at tagge denne postø!

_version_	1866915729683513344
author	Zhu, Rui Shen, Xin Wu, Shuchen Miao, Chenxi Yu, Xin Li, Yang Li, Weikang Xia, Deguo Huang, Jizhou
author_facet	Zhu, Rui Shen, Xin Wu, Shuchen Miao, Chenxi Yu, Xin Li, Yang Li, Weikang Xia, Deguo Huang, Jizhou
contents	Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_09430
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs Zhu, Rui Shen, Xin Wu, Shuchen Miao, Chenxi Yu, Xin Li, Yang Li, Weikang Xia, Deguo Huang, Jizhou Computer Vision and Pattern Recognition Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR.
title	Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2601.09430

Lignende værker