:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Wanyue, Huang, Yibin, Xu, Yangbin, Huang, JingJing, Zhi, Helu, Ren, Shuo, Xu, Wang, Zhang, Jiajun
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.02359
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
by: Huang, Yibin, et al.
Published: (2025)

World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
by: Zhang, Wanyue, et al.
Published: (2026)

Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis
by: Helu, Zhi, et al.
Published: (2025)

Why MLLMs Struggle to Determine Object Orientations
by: Gopinath, Anju, et al.
Published: (2026)

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
by: Tang, Yolo Y., et al.
Published: (2025)

Spatial Preference Rewarding for MLLMs Spatial Understanding
by: Qiu, Han, et al.
Published: (2025)

MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?
by: Dai, Shiqi, et al.
Published: (2025)

LEMON: How Well Do MLLMs Perform Temporal Multimodal Understanding on Instructional Videos?
by: Yu, Zhuang, et al.
Published: (2026)

Context Tokens are Anchors: Understanding the Repetition Curse in dMLLMs from an Information Flow Perspective
by: Zhao, Qiyan, et al.
Published: (2026)

Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding
by: Liu, Shuo, et al.
Published: (2026)

On the Generalization Capacities of MLLMs for Spatial Intelligence
by: Zhang, Gongjie, et al.
Published: (2026)

Why Do Vision Language Models Struggle To Recognize Human Emotions?
by: Agarwal, Madhav, et al.
Published: (2026)

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
by: Li, Yun, et al.
Published: (2025)

3D Spatial Understanding in MLLMs: Disambiguation and Evaluation
by: Chang, Chun-Peng, et al.
Published: (2024)

Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
by: Huang, Kung-Hsiang, et al.
Published: (2025)

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision
by: Li, Chentao, et al.
Published: (2026)

Aesthetic Image Captioning with Saliency Enhanced MLLMs
by: Tao, Yilin, et al.
Published: (2025)

LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks
by: Kong, Fei, et al.
Published: (2025)

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
by: Chen, Zhenghao, et al.
Published: (2026)

EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs
by: He, Yuping, et al.
Published: (2025)

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
by: Yu, Tianyu, et al.
Published: (2025)

ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?
by: Yang, Liu, et al.
Published: (2025)

Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs
by: Zhu, Fangrui, et al.
Published: (2025)

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
by: Wang, Junyang, et al.
Published: (2023)

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
by: Tang, Yolo Y., et al.
Published: (2024)

FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs
by: Hao, Jing, et al.
Published: (2024)

SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
by: Zhang, Yue, et al.
Published: (2024)

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
by: Ouyang, Kun, et al.
Published: (2025)

VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking
by: Meng, Desen, et al.
Published: (2025)

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
by: Zhang, Zixin, et al.
Published: (2025)

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
by: Li, Shuo, et al.
Published: (2025)

RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios
by: Zhang, Jun, et al.
Published: (2025)

Explore How to Inject Beneficial Noise in MLLMs
by: Zhu, Ruishu, et al.
Published: (2025)

Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly
by: Du, Hang, et al.
Published: (2024)

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
by: Lin, Weifeng, et al.
Published: (2024)

MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving
by: Zhang, Zhiyuan, et al.
Published: (2025)

3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
by: Huang, Xiaohu, et al.
Published: (2025)

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs
by: Zhang, Gengyuan, et al.
Published: (2025)

Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs
by: Zhu, Rui, et al.
Published: (2026)

Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
by: Zhang, Xu, et al.
Published: (2026)