:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Basappa, Aahana, Goel, Pranay, Karra, Anusri, Karra, Anish, Gilmore, Asa, Zhu, Kevin
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.17037
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors
by: Zhang, Junyang, et al.
Published: (2025)

Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption
by: Erol, Mehmet Kaan
Published: (2026)

Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables
by: Singh, Anshul, et al.
Published: (2025)

VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following
by: Hong, Hyesoo, et al.
Published: (2026)

PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
by: Mak, Chak-Wing, et al.
Published: (2026)

The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs
by: Azad, Asif, et al.
Published: (2025)

Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting
by: Guo, Xuyang, et al.
Published: (2025)

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
by: Tan, Zhangyun, et al.
Published: (2026)

CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs
by: Jian, Ai, et al.
Published: (2025)

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
by: Wang, Xingrui, et al.
Published: (2025)

CLASH: A Benchmark for Cross-Modal Contradiction Detection
by: Popordanoska, Teodora, et al.
Published: (2025)

CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems
by: Tian, Yonglin, et al.
Published: (2026)

Discovering Failure Modes in Vision-Language Models using RL
by: Jain, Kanishk, et al.
Published: (2026)

Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving
by: Tang, Zecong, et al.
Published: (2026)

Multi-Prompt with Depth Partitioned Cross-Modal Learning
by: Tian, Yingjie, et al.
Published: (2023)

Can Generalist Vision Language Models (VLMs) Rival Specialist Medical VLMs? Benchmarking and Strategic Insights
by: Zhong, Yuan, et al.
Published: (2025)

iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs
by: Mayer, Julius, et al.
Published: (2025)

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
by: Faraz, Ali, et al.
Published: (2025)

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs
by: Berman, Shmuel, et al.
Published: (2025)

Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs
by: Ballout, Mohamad, et al.
Published: (2025)

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
by: Oh, Youngtaek, et al.
Published: (2024)

CrossModalityDiffusion: Multi-Modal Novel View Synthesis with Unified Intermediate Representation
by: Berian, Alex, et al.
Published: (2025)

Enhancing Multimodal Unified Representations for Cross Modal Generalization
by: Huang, Hai, et al.
Published: (2024)

BioVLM: Routing Prompts, Not Parameters, for Cross-Modality Generalization in Biomedical VLMs
by: Singha, Mainak, et al.
Published: (2026)

A Study of Failure Modes in Two-Stage Human-Object Interaction Detection
by: Wang, Lemeng, et al.
Published: (2026)

FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection
by: Bhaskar, Paramananda, et al.
Published: (2026)

Source-Free Cross-Modal Knowledge Transfer by Unleashing the Potential of Task-Irrelevant Data
by: Zhu, Jinjing, et al.
Published: (2024)

Caption This, Reason That: VLMs Caught in the Middle
by: Weng, Zihan, et al.
Published: (2025)

T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs
by: Xia, Shao-Jun, et al.
Published: (2025)

CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration
by: Zhang, Dongxu, et al.
Published: (2026)

Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters
by: Li, Kevin Y., et al.
Published: (2024)

Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval
by: Du, Yang, et al.
Published: (2024)

Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving
by: Lian, Weitong, et al.
Published: (2026)

SANEval: Open-Vocabulary Compositional Benchmarks with Failure-mode Diagnosis
by: Pramanik, Rishav, et al.
Published: (2026)

Evaluating Compositional Generalisation in VLMs and Diffusion Models
by: Pearson, Beth, et al.
Published: (2025)

VACoT: Rethinking Visual Data Augmentation with VLMs
by: Xu, Zhengzhuo, et al.
Published: (2025)

Listener-Rewarded Thinking in VLMs for Image Preferences
by: Gambashidze, Alexander, et al.
Published: (2025)

Cross-Modal Learning of Housing Quality in Amsterdam
by: Levering, Alex, et al.
Published: (2024)

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations
by: Kim, Jeonghyeon, et al.
Published: (2025)

Federated Cross-Modal Retrieval with Missing Modalities via Semantic Routing and Adapter Personalization
by: Zhou, Hefeng, et al.
Published: (2026)