MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Li, Yun, Zhang, Yiming, Lin, Tao, Liu, Xiangrui, Cai, Wenxiao, Liu, Zheng, Zhao, Bo
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Computer Vision and Pattern Recognition
Accesso online:	https://arxiv.org/abs/2503.23765
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866913945394085888
author	Li, Yun Zhang, Yiming Lin, Tao Liu, Xiangrui Cai, Wenxiao Liu, Zheng Zhao, Bo
author_facet	Li, Yun Zhang, Yiming Lin, Tao Liu, Xiangrui Cai, Wenxiao Liu, Zheng Zhao, Bo
contents	The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_23765
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding? Li, Yun Zhang, Yiming Lin, Tao Liu, Xiangrui Cai, Wenxiao Liu, Zheng Zhao, Bo Computer Vision and Pattern Recognition The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.
title	STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.23765

Documenti analoghi