Saved in:
Bibliographic Details
Main Authors: Wang, Pan, Liu, Yang, Wu, Guile, Corral-Soto, Eduardo R., Huang, Chengjie, Xu, Binbin, Bai, Dongfeng, Yan, Xu, Ren, Yuan, Chen, Xingxin, Wu, Yizhe, Huang, Tao, Wan, Wenjun, Wu, Xin, Zhou, Pei, Dai, Xuyang, Lv, Kangbo, Zhang, Hongbo, Fried, Yosef, Ye, Aixue, Feng, Bailan, Chen, Zhenyu, Li, Zhen, Chen, Yingcong, Liao, Yiyi, Liu, Bingbing
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2601.00092
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911491841589248
author Wang, Pan
Liu, Yang
Wu, Guile
Corral-Soto, Eduardo R.
Huang, Chengjie
Xu, Binbin
Bai, Dongfeng
Yan, Xu
Ren, Yuan
Chen, Xingxin
Wu, Yizhe
Huang, Tao
Wan, Wenjun
Wu, Xin
Zhou, Pei
Dai, Xuyang
Lv, Kangbo
Zhang, Hongbo
Fried, Yosef
Ye, Aixue
Feng, Bailan
Chen, Zhenyu
Li, Zhen
Chen, Yingcong
Liao, Yiyi
Liu, Bingbing
author_facet Wang, Pan
Liu, Yang
Wu, Guile
Corral-Soto, Eduardo R.
Huang, Chengjie
Xu, Binbin
Bai, Dongfeng
Yan, Xu
Ren, Yuan
Chen, Xingxin
Wu, Yizhe
Huang, Tao
Wan, Wenjun
Wu, Xin
Zhou, Pei
Dai, Xuyang
Lv, Kangbo
Zhang, Hongbo
Fried, Yosef
Ye, Aixue
Feng, Bailan
Chen, Zhenyu
Li, Zhen
Chen, Yingcong
Liao, Yiyi
Liu, Bingbing
contents 4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.
format Preprint
id arxiv_https___arxiv_org_abs_2601_00092
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark
Wang, Pan
Liu, Yang
Wu, Guile
Corral-Soto, Eduardo R.
Huang, Chengjie
Xu, Binbin
Bai, Dongfeng
Yan, Xu
Ren, Yuan
Chen, Xingxin
Wu, Yizhe
Huang, Tao
Wan, Wenjun
Wu, Xin
Zhou, Pei
Dai, Xuyang
Lv, Kangbo
Zhang, Hongbo
Fried, Yosef
Ye, Aixue
Feng, Bailan
Chen, Zhenyu
Li, Zhen
Chen, Yingcong
Liao, Yiyi
Liu, Bingbing
Computer Vision and Pattern Recognition
4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.
title Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2601.00092