:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zou, Xiaohan, Kang, Jian, Kesidis, George, Lin, Lu
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2502.13095
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TechING: Towards Real World Technical Image Understanding via VLMs
by: Nadeem, Tafazzul, et al.
Published: (2026)

VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization
by: Chen, Menglan, et al.
Published: (2025)

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
by: Liang, Yijun, et al.
Published: (2025)

Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs
by: Li, Shuo, et al.
Published: (2024)

On the Perception Bottleneck of VLMs for Chart Understanding
by: Liu, Junteng, et al.
Published: (2025)

Leveraging NTPs for Efficient Hallucination Detection in VLMs
by: Azachi, Ofir, et al.
Published: (2025)

Sensitivity of Generative VLMs to Semantically and Lexically Altered Prompts
by: Dumpala, Sri Harsha, et al.
Published: (2024)

Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
by: Xu, Xiao, et al.
Published: (2025)

Bidirectional Long-Range Parser for Sequential Data Understanding
by: Leotescu, George, et al.
Published: (2024)

Robustness of Structured Data Extraction from Perspectively Distorted Documents
by: Nakada, Hyakka, et al.
Published: (2025)

Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?
by: Park, Simon, et al.
Published: (2025)

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
by: Nasiriany, Soroush, et al.
Published: (2024)

PerLA: Perceptive 3D Language Assistant
by: Mei, Guofeng, et al.
Published: (2024)

Can World Models Benefit VLMs for World Dynamics?
by: Zhang, Kevin, et al.
Published: (2025)

Unraveling the Truth: Do VLMs really Understand Charts? A Deep Dive into Consistency and Robustness
by: Mukhopadhyay, Srija, et al.
Published: (2024)

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs
by: Pan, Zhiyu, et al.
Published: (2026)

VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
by: Palaskar, Shruti, et al.
Published: (2025)

Temporal Preference Optimization for Long-Form Video Understanding
by: Li, Rui, et al.
Published: (2025)

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs
by: Wang, Hao, et al.
Published: (2026)

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
by: Karamcheti, Siddharth, et al.
Published: (2024)

Fine-tuning MLLMs Without Forgetting Is Easier Than You Think
by: Li, He, et al.
Published: (2026)

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
by: Daxberger, Erik, et al.
Published: (2025)

Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models
by: Li, Yue, et al.
Published: (2025)

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
by: Wang, Xiyao, et al.
Published: (2025)

Toward Inherently Robust VLMs Against Visual Perception Attacks
by: MohajerAnsari, Pedram, et al.
Published: (2025)

Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability
by: Hao, Tianxiang, et al.
Published: (2023)

Do VLMs Have a Moral Backbone? A Study on the Fragile Morality of Vision-Language Models
by: Liu, Zhining, et al.
Published: (2026)

DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry
by: Wu, Changti, et al.
Published: (2025)

Evaluating and Advancing Multimodal Large Language Models in Perception Ability Lens
by: Chen, Feng, et al.
Published: (2024)

Improving Language Understanding from Screenshots
by: Gao, Tianyu, et al.
Published: (2024)

LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception
by: Liao, Yuan-Hong, et al.
Published: (2025)

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
by: Ding, Yi, et al.
Published: (2024)

VisMin: Visual Minimal-Change Understanding
by: Awal, Rabiul, et al.
Published: (2024)

CIVET: Systematic Evaluation of Understanding in VLMs
by: Rizzoli, Massimo, et al.
Published: (2025)

Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
by: Wang, Xiao, et al.
Published: (2024)

MLLM-as-a-Judge for Image Safety without Human Labeling
by: Wang, Zhenting, et al.
Published: (2024)

Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark
by: Heyward, Joseph, et al.
Published: (2024)

Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement
by: Yi, Zhihang, et al.
Published: (2026)

DocAtlas: Multilingual Document Understanding Across 80+ Languages
by: Heakl, Ahmed, et al.
Published: (2026)

Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
by: Liu, Xiangyue, et al.
Published: (2026)