:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Lui, Hin Wai, Krichmar, Jeffrey L.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.07939
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Learning Vision from Models Rivals Learning Vision from Data
by: Tian, Yonglong, et al.
Published: (2023)

SpatialBot: Precise Spatial Understanding with Vision Language Models
by: Cai, Wenxiao, et al.
Published: (2024)

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
by: Wu, Xiyang, et al.
Published: (2025)

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
by: Cheng, An-Chieh, et al.
Published: (2024)

Temporal-Spatial Object Relations Modeling for Vision-and-Language Navigation
by: Huang, Bowen, et al.
Published: (2024)

PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks
by: Cui, Cheng, et al.
Published: (2026)

Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models
by: Souza, Rafael, et al.
Published: (2024)

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models
by: Deng, Nianchen, et al.
Published: (2025)

Vision-Language Memory for Spatial Reasoning
by: Liu, Zuntao, et al.
Published: (2025)

Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models
by: Huang, Xinmiao, et al.
Published: (2025)

Masked Diffusion Vision-Language Models for Temporal Action Localization
by: Wang, Fengshun, et al.
Published: (2026)

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models
by: Liang, Yiming, et al.
Published: (2026)

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
by: Li, Yun, et al.
Published: (2025)

GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
by: Xie, Qinghongbing, et al.
Published: (2025)

CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning
by: Chen, Zeyuan, et al.
Published: (2025)

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models
by: Stogiannidis, Ilias, et al.
Published: (2025)

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
by: Xu, Huilin, et al.
Published: (2025)

Precise GPS-Denied UAV Self-Positioning via Context-Enhanced Cross-View Geo-Localization
by: Xu, Yuanze, et al.
Published: (2025)

Ascending the Infinite Ladder: Benchmarking Spatial Deformation Reasoning in Vision-Language Models
by: Zhang, Jiahuan, et al.
Published: (2025)

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
by: Gholami, Mohsen, et al.
Published: (2025)

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
by: Zhou, Shengchao, et al.
Published: (2025)

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models
by: Guo, Xianda, et al.
Published: (2024)

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models
by: Cui, Kelly, et al.
Published: (2026)

STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
by: Agrawal, Palaash, et al.
Published: (2023)

Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
by: Xu, Ming, et al.
Published: (2024)

Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
by: Li, Jiaqi, et al.
Published: (2026)

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning
by: Tang, Yihong, et al.
Published: (2024)

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
by: Li, Hongxing, et al.
Published: (2025)

Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models
by: Liao, Yuan-Hong, et al.
Published: (2024)

On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
by: Ranjan, Mukul, et al.
Published: (2026)

Altitude-Adaptive Vision-Only Geo-Localization for UAVs in GPS-Denied Environments
by: Shao, Xingyu, et al.
Published: (2026)

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
by: Ko, Dohwan, et al.
Published: (2025)

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning
by: Yuan, Tianyuan, et al.
Published: (2025)

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
by: Feng, Zhiyuan, et al.
Published: (2025)

Temporal-Guided Visual Foundation Models for Event-Based Vision
by: Xia, Ruihao, et al.
Published: (2025)

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors
by: Ma, Chenyang, et al.
Published: (2024)

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
by: Jia, Mengdi, et al.
Published: (2025)

Progressive Cross-Stream Cooperation in Spatial and Temporal Domain for Action Localization
by: Su, Rui, et al.
Published: (2019)

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model
by: Liu, Benlin, et al.
Published: (2024)

MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model
by: Lee, Youngwan, et al.
Published: (2026)