:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wu, Zhiheng, Wang, Tong, Wang, Shuning, Liu, Naiming, Zhang, Yumeng
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.24339
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
by: Yu, Seonghoon, et al.
Published: (2026)

Think Twice to See More: Iterative Visual Reasoning in Medical VLMs
by: Chen, Kaitao, et al.
Published: (2025)

Watch Wider and Think Deeper: Collaborative Cross-modal Chain-of-Thought for Complex Visual Reasoning
by: Lu, Wenting, et al.
Published: (2026)

Enhancing Advanced Visual Reasoning Ability of Large Language Models
by: Li, Zhiyuan, et al.
Published: (2024)

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
by: Wang, Haozhe, et al.
Published: (2026)

Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
by: Zhan, Yufei, et al.
Published: (2025)

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
by: Liu, Chengzhi, et al.
Published: (2025)

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent
by: Tang, Tianci, et al.
Published: (2026)

MediSee: Reasoning-based Pixel-level Perception in Medical Images
by: Tong, Qinyue, et al.
Published: (2025)

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
by: Wu, Junfei, et al.
Published: (2025)

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
by: Liu, Tianhui, et al.
Published: (2026)

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
by: Qin, Yiming, et al.
Published: (2025)

MedHorizon: Towards Long-context Medical Video Understanding in the Wild
by: Du, Bodong, et al.
Published: (2026)

Beyond Frequency: Seeing Subtle Cues Through the Lens of Spatial Decomposition for Fine-Grained Visual Classification
by: Xu, Qin, et al.
Published: (2025)

VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning
by: Wang, Zhaozhi, et al.
Published: (2025)

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
by: Gao, Mingjian, et al.
Published: (2026)

TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning
by: Liu, Junhua, et al.
Published: (2026)

Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT
by: Dong, Zhuobai, et al.
Published: (2025)

Belief-Aware VLM Model for Human-like Reasoning
by: Nayak, Anshul, et al.
Published: (2026)

Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning
by: Wang, Wentao, et al.
Published: (2025)

Unlocking Feature Visualization for Deeper Networks with MAgnitude Constrained Optimization
by: Fel, Thomas, et al.
Published: (2023)

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs
by: Li, Yiwei, et al.
Published: (2026)

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images
by: Duan, Chengqi, et al.
Published: (2025)

Chatting with Images for Introspective Visual Thinking
by: Wu, Junfei, et al.
Published: (2026)

SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding
by: Lin, Jiawen, et al.
Published: (2025)

Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
by: Liu, Zhining, et al.
Published: (2025)

Enhancing Spatial Reasoning through Visual and Textual Thinking
by: Liang, Xun, et al.
Published: (2025)

GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking
by: Zhan, Yufei, et al.
Published: (2025)

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
by: Jing, Miao, et al.
Published: (2025)

Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts
by: Liu, Xu, et al.
Published: (2026)

Think Visually, Reason Textually: Vision-Language Synergy in ARC
by: Zhang, Beichen, et al.
Published: (2025)

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
by: Zhang, Jialiang, et al.
Published: (2026)

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models
by: Peng, Ruiying, et al.
Published: (2026)

See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition
by: Si, Chongjie, et al.
Published: (2024)

VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models
by: Wu, Kui, et al.
Published: (2025)

Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning
by: Ma, Xueqi, et al.
Published: (2026)

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
by: Wu, Yixuan, et al.
Published: (2024)

InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding
by: Zhang, Huaxiang, et al.
Published: (2024)

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
by: Xiao, Xin, et al.
Published: (2024)

Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning
by: Pang, Yuqi, et al.
Published: (2025)