Saved in:
| Main Authors: | Zhang, Wanyue, Huang, Yibin, Xu, Yangbin, Huang, JingJing, Zhi, Helu, Ren, Shuo, Xu, Wang, Zhang, Jiajun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.02359 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
by: Huang, Yibin, et al.
Published: (2025)
by: Huang, Yibin, et al.
Published: (2025)
World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
by: Zhang, Wanyue, et al.
Published: (2026)
by: Zhang, Wanyue, et al.
Published: (2026)
Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis
by: Helu, Zhi, et al.
Published: (2025)
by: Helu, Zhi, et al.
Published: (2025)
Why MLLMs Struggle to Determine Object Orientations
by: Gopinath, Anju, et al.
Published: (2026)
by: Gopinath, Anju, et al.
Published: (2026)
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
by: Tang, Yolo Y., et al.
Published: (2025)
by: Tang, Yolo Y., et al.
Published: (2025)
Spatial Preference Rewarding for MLLMs Spatial Understanding
by: Qiu, Han, et al.
Published: (2025)
by: Qiu, Han, et al.
Published: (2025)
MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?
by: Dai, Shiqi, et al.
Published: (2025)
by: Dai, Shiqi, et al.
Published: (2025)
LEMON: How Well Do MLLMs Perform Temporal Multimodal Understanding on Instructional Videos?
by: Yu, Zhuang, et al.
Published: (2026)
by: Yu, Zhuang, et al.
Published: (2026)
Context Tokens are Anchors: Understanding the Repetition Curse in dMLLMs from an Information Flow Perspective
by: Zhao, Qiyan, et al.
Published: (2026)
by: Zhao, Qiyan, et al.
Published: (2026)
Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding
by: Liu, Shuo, et al.
Published: (2026)
by: Liu, Shuo, et al.
Published: (2026)
On the Generalization Capacities of MLLMs for Spatial Intelligence
by: Zhang, Gongjie, et al.
Published: (2026)
by: Zhang, Gongjie, et al.
Published: (2026)
Why Do Vision Language Models Struggle To Recognize Human Emotions?
by: Agarwal, Madhav, et al.
Published: (2026)
by: Agarwal, Madhav, et al.
Published: (2026)
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
by: Li, Yun, et al.
Published: (2025)
by: Li, Yun, et al.
Published: (2025)
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation
by: Chang, Chun-Peng, et al.
Published: (2024)
by: Chang, Chun-Peng, et al.
Published: (2024)
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
by: Huang, Kung-Hsiang, et al.
Published: (2025)
by: Huang, Kung-Hsiang, et al.
Published: (2025)
Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision
by: Li, Chentao, et al.
Published: (2026)
by: Li, Chentao, et al.
Published: (2026)
Aesthetic Image Captioning with Saliency Enhanced MLLMs
by: Tao, Yilin, et al.
Published: (2025)
by: Tao, Yilin, et al.
Published: (2025)
LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks
by: Kong, Fei, et al.
Published: (2025)
by: Kong, Fei, et al.
Published: (2025)
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
by: Chen, Zhenghao, et al.
Published: (2026)
by: Chen, Zhenghao, et al.
Published: (2026)
EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs
by: He, Yuping, et al.
Published: (2025)
by: He, Yuping, et al.
Published: (2025)
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
by: Yu, Tianyu, et al.
Published: (2025)
by: Yu, Tianyu, et al.
Published: (2025)
ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?
by: Yang, Liu, et al.
Published: (2025)
by: Yang, Liu, et al.
Published: (2025)
Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs
by: Zhu, Fangrui, et al.
Published: (2025)
by: Zhu, Fangrui, et al.
Published: (2025)
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
by: Wang, Junyang, et al.
Published: (2023)
by: Wang, Junyang, et al.
Published: (2023)
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
by: Tang, Yolo Y., et al.
Published: (2024)
by: Tang, Yolo Y., et al.
Published: (2024)
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs
by: Hao, Jing, et al.
Published: (2024)
by: Hao, Jing, et al.
Published: (2024)
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
by: Zhang, Yue, et al.
Published: (2024)
by: Zhang, Yue, et al.
Published: (2024)
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
by: Ouyang, Kun, et al.
Published: (2025)
by: Ouyang, Kun, et al.
Published: (2025)
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking
by: Meng, Desen, et al.
Published: (2025)
by: Meng, Desen, et al.
Published: (2025)
PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
by: Zhang, Zixin, et al.
Published: (2025)
by: Zhang, Zixin, et al.
Published: (2025)
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
by: Li, Shuo, et al.
Published: (2025)
by: Li, Shuo, et al.
Published: (2025)
RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios
by: Zhang, Jun, et al.
Published: (2025)
by: Zhang, Jun, et al.
Published: (2025)
Explore How to Inject Beneficial Noise in MLLMs
by: Zhu, Ruishu, et al.
Published: (2025)
by: Zhu, Ruishu, et al.
Published: (2025)
Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly
by: Du, Hang, et al.
Published: (2024)
by: Du, Hang, et al.
Published: (2024)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
by: Lin, Weifeng, et al.
Published: (2024)
by: Lin, Weifeng, et al.
Published: (2024)
MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving
by: Zhang, Zhiyuan, et al.
Published: (2025)
by: Zhang, Zhiyuan, et al.
Published: (2025)
3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
by: Huang, Xiaohu, et al.
Published: (2025)
by: Huang, Xiaohu, et al.
Published: (2025)
Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs
by: Zhang, Gengyuan, et al.
Published: (2025)
by: Zhang, Gengyuan, et al.
Published: (2025)
Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs
by: Zhu, Rui, et al.
Published: (2026)
by: Zhu, Rui, et al.
Published: (2026)
Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
by: Zhang, Xu, et al.
Published: (2026)
by: Zhang, Xu, et al.
Published: (2026)
Similar Items
-
Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
by: Huang, Yibin, et al.
Published: (2025) -
World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
by: Zhang, Wanyue, et al.
Published: (2026) -
Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis
by: Helu, Zhi, et al.
Published: (2025) -
Why MLLMs Struggle to Determine Object Orientations
by: Gopinath, Anju, et al.
Published: (2026) -
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
by: Tang, Yolo Y., et al.
Published: (2025)