Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Wang, Wenqi, Tan, Reuben, Zhu, Pengyue, Yang, Jianwei, Yang, Zhengyuan, Wang, Lijuan, Kolobov, Andrey, Gao, Jianfeng, Gong, Boqing
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Computer Vision and Pattern Recognition
Online-Zugang:	https://arxiv.org/abs/2505.05456
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866917080449679360
author	Wang, Wenqi Tan, Reuben Zhu, Pengyue Yang, Jianwei Yang, Zhengyuan Wang, Lijuan Kolobov, Andrey Gao, Jianfeng Gong, Boqing
author_facet	Wang, Wenqi Tan, Reuben Zhu, Pengyue Yang, Jianwei Yang, Zhengyuan Wang, Lijuan Kolobov, Andrey Gao, Jianfeng Gong, Boqing
contents	Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models' spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model's spatial reasoning proficiency and its performance on an embodied AI task.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_05456
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SITE: towards Spatial Intelligence Thorough Evaluation Wang, Wenqi Tan, Reuben Zhu, Pengyue Yang, Jianwei Yang, Zhengyuan Wang, Lijuan Kolobov, Andrey Gao, Jianfeng Gong, Boqing Computer Vision and Pattern Recognition Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models' spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model's spatial reasoning proficiency and its performance on an embodied AI task.
title	SITE: towards Spatial Intelligence Thorough Evaluation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2505.05456

Ähnliche Einträge