Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ko, Dohwan, Kim, Sihyeon, Suh, Yumin, G, Vijay Kumar B., Yoon, Minseo, Chandraker, Manmohan, Kim, Hyunwoo J.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.19355
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912294511837184
author	Ko, Dohwan Kim, Sihyeon Suh, Yumin G, Vijay Kumar B. Yoon, Minseo Chandraker, Manmohan Kim, Hyunwoo J.
author_facet	Ko, Dohwan Kim, Sihyeon Suh, Yumin G, Vijay Kumar B. Yoon, Minseo Chandraker, Manmohan Kim, Hyunwoo J.
contents	Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_19355
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models Ko, Dohwan Kim, Sihyeon Suh, Yumin G, Vijay Kumar B. Yoon, Minseo Chandraker, Manmohan Kim, Hyunwoo J. Computer Vision and Pattern Recognition Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.
title	ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.19355

Similar Items