:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sharma, Sourabh, Gupta, Sonam, Sadbhawna
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2512.02456
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
by: Liu, Chengzhi, et al.
Published: (2025)

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts
by: Xu, Haolei, et al.
Published: (2026)

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
by: Li, Yunxin, et al.
Published: (2025)

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
by: Wu, Juncheng, et al.
Published: (2026)

ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding
by: Yang, Jianjiang, et al.
Published: (2025)

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
by: Yu, Seonghoon, et al.
Published: (2026)

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
by: Tong, Jingqi, et al.
Published: (2025)

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
by: Zhu, Wenxin, et al.
Published: (2025)

Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge
by: Liang, Hao, et al.
Published: (2025)

Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models
by: Caffagni, Davide, et al.
Published: (2025)

Knowledge-Aware Reasoning over Multimodal Semi-structured Tables
by: Mathur, Suyash Vardhan, et al.
Published: (2024)

See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation
by: Rizwan, Naquee, et al.
Published: (2026)

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
by: Yuan, Qianhao, et al.
Published: (2026)

Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era
by: Oneata, Dan, et al.
Published: (2025)

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning
by: Tao, Xingjian, et al.
Published: (2026)

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture
by: Zhang, Longxiang, et al.
Published: (2026)

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
by: Chung, Jiwan, et al.
Published: (2025)

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
by: Yang, Jihan, et al.
Published: (2024)

Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction
by: Hu, Juncheng, et al.
Published: (2026)

Seeing Culture: A Benchmark for Visual Reasoning and Grounding
by: Satar, Burak, et al.
Published: (2025)

BLINK: Multimodal Large Language Models Can See but Not Perceive
by: Fu, Xingyu, et al.
Published: (2024)

Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
by: Lai, Zhengzhao, et al.
Published: (2025)

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
by: Sun, Kaiser, et al.
Published: (2026)

Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation
by: Martin, Alexander, et al.
Published: (2025)

LaRe: Latent Refocusing for Multimodal Reasoning
by: Ma, Jizheng, et al.
Published: (2025)

Reinforcing Multimodal Reasoning Against Visual Degradation
by: Liu, Rui, et al.
Published: (2026)

Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs
by: Wang, Wenxuan, et al.
Published: (2025)

Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
by: Wu, Jiaying, et al.
Published: (2025)

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
by: Chen, Liang, et al.
Published: (2025)

Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models
by: Wang, Lu, et al.
Published: (2026)

From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning
by: Sharif, Omar, et al.
Published: (2026)

One RL to See Them All: Visual Triple Unified Reinforcement Learning
by: Ma, Yan, et al.
Published: (2025)

Diving into Self-Evolving Training for Multimodal Reasoning
by: Liu, Wei, et al.
Published: (2024)

Unleashing Perception-Time Scaling to Multimodal Reasoning Models
by: Li, Yifan, et al.
Published: (2025)

Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection
by: Yang, Ruichao, et al.
Published: (2026)

Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
by: Hua, Jiacheng, et al.
Published: (2026)

MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
by: Gan, Ziliang, et al.
Published: (2024)

Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
by: Zhong, Qihuang, et al.
Published: (2026)

C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
by: Chen, Xiuwei, et al.
Published: (2025)

Thinking with Programming Vision: Towards a Unified View for Thinking with Images
by: Guo, Zirun, et al.
Published: (2025)