Saved in:
| Main Authors: | Zhou, Guanyu, Yin, Yida, Chai, Wenhao, Tong, Shengbang, Fu, Xingyu, Liu, Zhuang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.09531 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
UEval: A Benchmark for Unified Multimodal Generation
by: Li, Bo, et al.
Published: (2026)
by: Li, Bo, et al.
Published: (2026)
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
by: Shahgir, Haz Sameen, et al.
Published: (2026)
by: Shahgir, Haz Sameen, et al.
Published: (2026)
On the Perception Bottleneck of VLMs for Chart Understanding
by: Liu, Junteng, et al.
Published: (2025)
by: Liu, Junteng, et al.
Published: (2025)
Reinforced Visual Perception with Tools
by: Zhou, Zetong, et al.
Published: (2025)
by: Zhou, Zetong, et al.
Published: (2025)
Understanding and Rectifying Safety Perception Distortion in VLMs
by: Zou, Xiaohan, et al.
Published: (2025)
by: Zou, Xiaohan, et al.
Published: (2025)
Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI
by: Shakhadri, Syed Abdul Gaffar, et al.
Published: (2025)
by: Shakhadri, Syed Abdul Gaffar, et al.
Published: (2025)
Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
by: Jian, Pu, et al.
Published: (2025)
by: Jian, Pu, et al.
Published: (2025)
Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG
by: Wang, Wenbin, et al.
Published: (2025)
by: Wang, Wenbin, et al.
Published: (2025)
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
by: Fu, Xingyu, et al.
Published: (2025)
by: Fu, Xingyu, et al.
Published: (2025)
Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
by: Shi, Chufan, et al.
Published: (2026)
by: Shi, Chufan, et al.
Published: (2026)
Teaching Text-to-Image Models to Communicate in Dialog
by: Sun, Xiaowen, et al.
Published: (2023)
by: Sun, Xiaowen, et al.
Published: (2023)
Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering
by: Fu, Xingyu, et al.
Published: (2023)
by: Fu, Xingyu, et al.
Published: (2023)
BabyVision: Visual Reasoning Beyond Language
by: Chen, Liang, et al.
Published: (2026)
by: Chen, Liang, et al.
Published: (2026)
Understanding Bias in Large-Scale Visual Datasets
by: Zeng, Boya, et al.
Published: (2024)
by: Zeng, Boya, et al.
Published: (2024)
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
by: Tong, Shengbang, et al.
Published: (2024)
by: Tong, Shengbang, et al.
Published: (2024)
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
by: Chen, Liang, et al.
Published: (2025)
by: Chen, Liang, et al.
Published: (2025)
Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations
by: Dai, Haocheng, et al.
Published: (2024)
by: Dai, Haocheng, et al.
Published: (2024)
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
by: Liu, Peng, et al.
Published: (2025)
by: Liu, Peng, et al.
Published: (2025)
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
by: Nasiriany, Soroush, et al.
Published: (2024)
by: Nasiriany, Soroush, et al.
Published: (2024)
Are VLMs Really Blind
by: Singh, Ayush, et al.
Published: (2024)
by: Singh, Ayush, et al.
Published: (2024)
Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging
by: Poggi, Nicolas, et al.
Published: (2025)
by: Poggi, Nicolas, et al.
Published: (2025)
SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs
by: Avogaro, Niccolo, et al.
Published: (2026)
by: Avogaro, Niccolo, et al.
Published: (2026)
Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding
by: Ye, Junyi, et al.
Published: (2024)
by: Ye, Junyi, et al.
Published: (2024)
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
by: Zhang, Ce, et al.
Published: (2025)
by: Zhang, Ce, et al.
Published: (2025)
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
by: Hu, Yushi, et al.
Published: (2024)
by: Hu, Yushi, et al.
Published: (2024)
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)
by: Wang, Zhaokai, et al.
Published: (2025)
ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
by: Zhang, Juntian, et al.
Published: (2025)
by: Zhang, Juntian, et al.
Published: (2025)
SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking
by: Li, Sifan, et al.
Published: (2025)
by: Li, Sifan, et al.
Published: (2025)
iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs
by: Mayer, Julius, et al.
Published: (2025)
by: Mayer, Julius, et al.
Published: (2025)
Can VLMs Recall Factual Associations From Visual References?
by: Ashok, Dhananjay, et al.
Published: (2025)
by: Ashok, Dhananjay, et al.
Published: (2025)
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
by: Wang, Dianyi, et al.
Published: (2025)
by: Wang, Dianyi, et al.
Published: (2025)
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
by: Zhang, Yanzhe, et al.
Published: (2023)
by: Zhang, Yanzhe, et al.
Published: (2023)
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information
by: Kamoi, Ryo, et al.
Published: (2024)
by: Kamoi, Ryo, et al.
Published: (2024)
CIVET: Systematic Evaluation of Understanding in VLMs
by: Rizzoli, Massimo, et al.
Published: (2025)
by: Rizzoli, Massimo, et al.
Published: (2025)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
by: Jia, Mengzhao, et al.
Published: (2024)
by: Jia, Mengzhao, et al.
Published: (2024)
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
by: Zhai, Yuexiang, et al.
Published: (2024)
by: Zhai, Yuexiang, et al.
Published: (2024)
From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models
by: Fang, Irving, et al.
Published: (2025)
by: Fang, Irving, et al.
Published: (2025)
Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)
by: Zhou, Yucheng, et al.
Published: (2024)
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
by: Yeh, Chun-Hsiao, et al.
Published: (2025)
by: Yeh, Chun-Hsiao, et al.
Published: (2025)
Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
by: Ye, Weihao, et al.
Published: (2024)
by: Ye, Weihao, et al.
Published: (2024)
Similar Items
-
UEval: A Benchmark for Unified Multimodal Generation
by: Li, Bo, et al.
Published: (2026) -
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
by: Shahgir, Haz Sameen, et al.
Published: (2026) -
On the Perception Bottleneck of VLMs for Chart Understanding
by: Liu, Junteng, et al.
Published: (2025) -
Reinforced Visual Perception with Tools
by: Zhou, Zetong, et al.
Published: (2025) -
Understanding and Rectifying Safety Perception Distortion in VLMs
by: Zou, Xiaohan, et al.
Published: (2025)