:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhou, Guanyu, Yin, Yida, Chai, Wenhao, Tong, Shengbang, Fu, Xingyu, Liu, Zhuang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2604.09531
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

UEval: A Benchmark for Unified Multimodal Generation
by: Li, Bo, et al.
Published: (2026)

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
by: Shahgir, Haz Sameen, et al.
Published: (2026)

On the Perception Bottleneck of VLMs for Chart Understanding
by: Liu, Junteng, et al.
Published: (2025)

Reinforced Visual Perception with Tools
by: Zhou, Zetong, et al.
Published: (2025)

Understanding and Rectifying Safety Perception Distortion in VLMs
by: Zou, Xiaohan, et al.
Published: (2025)

Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI
by: Shakhadri, Syed Abdul Gaffar, et al.
Published: (2025)

Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
by: Jian, Pu, et al.
Published: (2025)

Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG
by: Wang, Wenbin, et al.
Published: (2025)

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
by: Fu, Xingyu, et al.
Published: (2025)

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
by: Shi, Chufan, et al.
Published: (2026)

Teaching Text-to-Image Models to Communicate in Dialog
by: Sun, Xiaowen, et al.
Published: (2023)

Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering
by: Fu, Xingyu, et al.
Published: (2023)

BabyVision: Visual Reasoning Beyond Language
by: Chen, Liang, et al.
Published: (2026)

Understanding Bias in Large-Scale Visual Datasets
by: Zeng, Boya, et al.
Published: (2024)

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
by: Tong, Shengbang, et al.
Published: (2024)

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
by: Chen, Liang, et al.
Published: (2025)

Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations
by: Dai, Haocheng, et al.
Published: (2024)

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
by: Liu, Peng, et al.
Published: (2025)

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
by: Nasiriany, Soroush, et al.
Published: (2024)

Are VLMs Really Blind
by: Singh, Ayush, et al.
Published: (2024)

Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging
by: Poggi, Nicolas, et al.
Published: (2025)

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs
by: Avogaro, Niccolo, et al.
Published: (2026)

Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding
by: Ye, Junyi, et al.
Published: (2024)

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
by: Zhang, Ce, et al.
Published: (2025)

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
by: Hu, Yushi, et al.
Published: (2024)

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)

ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
by: Zhang, Juntian, et al.
Published: (2025)

SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking
by: Li, Sifan, et al.
Published: (2025)

iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs
by: Mayer, Julius, et al.
Published: (2025)

Can VLMs Recall Factual Associations From Visual References?
by: Ashok, Dhananjay, et al.
Published: (2025)

Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
by: Wang, Dianyi, et al.
Published: (2025)

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
by: Zhang, Yanzhe, et al.
Published: (2023)

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information
by: Kamoi, Ryo, et al.
Published: (2024)

CIVET: Systematic Evaluation of Understanding in VLMs
by: Rizzoli, Massimo, et al.
Published: (2025)

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
by: Jia, Mengzhao, et al.
Published: (2024)

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
by: Zhai, Yuexiang, et al.
Published: (2024)

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models
by: Fang, Irving, et al.
Published: (2025)

Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
by: Yeh, Chun-Hsiao, et al.
Published: (2025)

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
by: Ye, Weihao, et al.
Published: (2024)