Saved in:
| Main Authors: | Lui, Hin Wai, Krichmar, Jeffrey L. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.07939 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Learning Vision from Models Rivals Learning Vision from Data
by: Tian, Yonglong, et al.
Published: (2023)
by: Tian, Yonglong, et al.
Published: (2023)
SpatialBot: Precise Spatial Understanding with Vision Language Models
by: Cai, Wenxiao, et al.
Published: (2024)
by: Cai, Wenxiao, et al.
Published: (2024)
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
by: Wu, Xiyang, et al.
Published: (2025)
by: Wu, Xiyang, et al.
Published: (2025)
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
by: Cheng, An-Chieh, et al.
Published: (2024)
by: Cheng, An-Chieh, et al.
Published: (2024)
Temporal-Spatial Object Relations Modeling for Vision-and-Language Navigation
by: Huang, Bowen, et al.
Published: (2024)
by: Huang, Bowen, et al.
Published: (2024)
PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks
by: Cui, Cheng, et al.
Published: (2026)
by: Cui, Cheng, et al.
Published: (2026)
Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models
by: Souza, Rafael, et al.
Published: (2024)
by: Souza, Rafael, et al.
Published: (2024)
InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models
by: Deng, Nianchen, et al.
Published: (2025)
by: Deng, Nianchen, et al.
Published: (2025)
Vision-Language Memory for Spatial Reasoning
by: Liu, Zuntao, et al.
Published: (2025)
by: Liu, Zuntao, et al.
Published: (2025)
Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models
by: Huang, Xinmiao, et al.
Published: (2025)
by: Huang, Xinmiao, et al.
Published: (2025)
Masked Diffusion Vision-Language Models for Temporal Action Localization
by: Wang, Fengshun, et al.
Published: (2026)
by: Wang, Fengshun, et al.
Published: (2026)
STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models
by: Liang, Yiming, et al.
Published: (2026)
by: Liang, Yiming, et al.
Published: (2026)
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
by: Li, Yun, et al.
Published: (2025)
by: Li, Yun, et al.
Published: (2025)
GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
by: Xie, Qinghongbing, et al.
Published: (2025)
by: Xie, Qinghongbing, et al.
Published: (2025)
CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning
by: Chen, Zeyuan, et al.
Published: (2025)
by: Chen, Zeyuan, et al.
Published: (2025)
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models
by: Stogiannidis, Ilias, et al.
Published: (2025)
by: Stogiannidis, Ilias, et al.
Published: (2025)
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
by: Xu, Huilin, et al.
Published: (2025)
by: Xu, Huilin, et al.
Published: (2025)
Precise GPS-Denied UAV Self-Positioning via Context-Enhanced Cross-View Geo-Localization
by: Xu, Yuanze, et al.
Published: (2025)
by: Xu, Yuanze, et al.
Published: (2025)
Ascending the Infinite Ladder: Benchmarking Spatial Deformation Reasoning in Vision-Language Models
by: Zhang, Jiahuan, et al.
Published: (2025)
by: Zhang, Jiahuan, et al.
Published: (2025)
Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
by: Gholami, Mohsen, et al.
Published: (2025)
by: Gholami, Mohsen, et al.
Published: (2025)
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
by: Zhou, Shengchao, et al.
Published: (2025)
by: Zhou, Shengchao, et al.
Published: (2025)
SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models
by: Guo, Xianda, et al.
Published: (2024)
by: Guo, Xianda, et al.
Published: (2024)
The Dual Mechanisms of Spatial Reasoning in Vision-Language Models
by: Cui, Kelly, et al.
Published: (2026)
by: Cui, Kelly, et al.
Published: (2026)
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
by: Agrawal, Palaash, et al.
Published: (2023)
by: Agrawal, Palaash, et al.
Published: (2023)
Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
by: Xu, Ming, et al.
Published: (2024)
by: Xu, Ming, et al.
Published: (2024)
Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
by: Li, Jiaqi, et al.
Published: (2026)
by: Li, Jiaqi, et al.
Published: (2026)
Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning
by: Tang, Yihong, et al.
Published: (2024)
by: Tang, Yihong, et al.
Published: (2024)
SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
by: Li, Hongxing, et al.
Published: (2025)
by: Li, Hongxing, et al.
Published: (2025)
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models
by: Liao, Yuan-Hong, et al.
Published: (2024)
by: Liao, Yuan-Hong, et al.
Published: (2024)
On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
by: Ranjan, Mukul, et al.
Published: (2026)
by: Ranjan, Mukul, et al.
Published: (2026)
Altitude-Adaptive Vision-Only Geo-Localization for UAVs in GPS-Denied Environments
by: Shao, Xingyu, et al.
Published: (2026)
by: Shao, Xingyu, et al.
Published: (2026)
ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
by: Ko, Dohwan, et al.
Published: (2025)
by: Ko, Dohwan, et al.
Published: (2025)
DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning
by: Yuan, Tianyuan, et al.
Published: (2025)
by: Yuan, Tianyuan, et al.
Published: (2025)
Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
by: Feng, Zhiyuan, et al.
Published: (2025)
by: Feng, Zhiyuan, et al.
Published: (2025)
Temporal-Guided Visual Foundation Models for Event-Based Vision
by: Xia, Ruihao, et al.
Published: (2025)
by: Xia, Ruihao, et al.
Published: (2025)
SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors
by: Ma, Chenyang, et al.
Published: (2024)
by: Ma, Chenyang, et al.
Published: (2024)
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
by: Jia, Mengdi, et al.
Published: (2025)
by: Jia, Mengdi, et al.
Published: (2025)
Progressive Cross-Stream Cooperation in Spatial and Temporal Domain for Action Localization
by: Su, Rui, et al.
Published: (2019)
by: Su, Rui, et al.
Published: (2019)
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model
by: Liu, Benlin, et al.
Published: (2024)
by: Liu, Benlin, et al.
Published: (2024)
MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model
by: Lee, Youngwan, et al.
Published: (2026)
by: Lee, Youngwan, et al.
Published: (2026)
Similar Items
-
Learning Vision from Models Rivals Learning Vision from Data
by: Tian, Yonglong, et al.
Published: (2023) -
SpatialBot: Precise Spatial Understanding with Vision Language Models
by: Cai, Wenxiao, et al.
Published: (2024) -
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
by: Wu, Xiyang, et al.
Published: (2025) -
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
by: Cheng, An-Chieh, et al.
Published: (2024) -
Temporal-Spatial Object Relations Modeling for Vision-and-Language Navigation
by: Huang, Bowen, et al.
Published: (2024)