Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Liu, Jingping, Liu, Ziyan, Cen, Zhedong, Zhou, Yan, Zou, Yinan, Zhang, Weiyan, Jiang, Haiyun, Ruan, Tong
Format:	Preprint
Publié:	2025
Sujets:	Computer Vision and Pattern Recognition Multimedia
Accès en ligne:	https://arxiv.org/abs/2505.19015
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866916886143303680
author	Liu, Jingping Liu, Ziyan Cen, Zhedong Zhou, Yan Zou, Yinan Zhang, Weiyan Jiang, Haiyun Ruan, Tong
author_facet	Liu, Jingping Liu, Ziyan Cen, Zhedong Zhou, Yan Zou, Yinan Zhang, Weiyan Jiang, Haiyun Ruan, Tong
contents	Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://github.com/ziyan-xiaoyu/SpatialMQA.git.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_19015
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Can Multimodal Large Language Models Understand Spatial Relations? Liu, Jingping Liu, Ziyan Cen, Zhedong Zhou, Yan Zou, Yinan Zhang, Weiyan Jiang, Haiyun Ruan, Tong Computer Vision and Pattern Recognition Multimedia Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://github.com/ziyan-xiaoyu/SpatialMQA.git.
title	Can Multimodal Large Language Models Understand Spatial Relations?
topic	Computer Vision and Pattern Recognition Multimedia
url	https://arxiv.org/abs/2505.19015

Documents similaires