Saved in:
| Main Authors: | Basappa, Aahana, Goel, Pranay, Karra, Anusri, Karra, Anish, Gilmore, Asa, Zhu, Kevin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.17037 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors
by: Zhang, Junyang, et al.
Published: (2025)
by: Zhang, Junyang, et al.
Published: (2025)
Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption
by: Erol, Mehmet Kaan
Published: (2026)
by: Erol, Mehmet Kaan
Published: (2026)
Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables
by: Singh, Anshul, et al.
Published: (2025)
by: Singh, Anshul, et al.
Published: (2025)
VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following
by: Hong, Hyesoo, et al.
Published: (2026)
by: Hong, Hyesoo, et al.
Published: (2026)
PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
by: Mak, Chak-Wing, et al.
Published: (2026)
by: Mak, Chak-Wing, et al.
Published: (2026)
The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs
by: Azad, Asif, et al.
Published: (2025)
by: Azad, Asif, et al.
Published: (2025)
Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting
by: Guo, Xuyang, et al.
Published: (2025)
by: Guo, Xuyang, et al.
Published: (2025)
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
by: Tan, Zhangyun, et al.
Published: (2026)
by: Tan, Zhangyun, et al.
Published: (2026)
CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs
by: Jian, Ai, et al.
Published: (2025)
by: Jian, Ai, et al.
Published: (2025)
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
by: Wang, Xingrui, et al.
Published: (2025)
by: Wang, Xingrui, et al.
Published: (2025)
CLASH: A Benchmark for Cross-Modal Contradiction Detection
by: Popordanoska, Teodora, et al.
Published: (2025)
by: Popordanoska, Teodora, et al.
Published: (2025)
CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems
by: Tian, Yonglin, et al.
Published: (2026)
by: Tian, Yonglin, et al.
Published: (2026)
Discovering Failure Modes in Vision-Language Models using RL
by: Jain, Kanishk, et al.
Published: (2026)
by: Jain, Kanishk, et al.
Published: (2026)
Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving
by: Tang, Zecong, et al.
Published: (2026)
by: Tang, Zecong, et al.
Published: (2026)
Multi-Prompt with Depth Partitioned Cross-Modal Learning
by: Tian, Yingjie, et al.
Published: (2023)
by: Tian, Yingjie, et al.
Published: (2023)
Can Generalist Vision Language Models (VLMs) Rival Specialist Medical VLMs? Benchmarking and Strategic Insights
by: Zhong, Yuan, et al.
Published: (2025)
by: Zhong, Yuan, et al.
Published: (2025)
iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs
by: Mayer, Julius, et al.
Published: (2025)
by: Mayer, Julius, et al.
Published: (2025)
IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
by: Faraz, Ali, et al.
Published: (2025)
by: Faraz, Ali, et al.
Published: (2025)
VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs
by: Berman, Shmuel, et al.
Published: (2025)
by: Berman, Shmuel, et al.
Published: (2025)
Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs
by: Ballout, Mohamad, et al.
Published: (2025)
by: Ballout, Mohamad, et al.
Published: (2025)
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
by: Oh, Youngtaek, et al.
Published: (2024)
by: Oh, Youngtaek, et al.
Published: (2024)
CrossModalityDiffusion: Multi-Modal Novel View Synthesis with Unified Intermediate Representation
by: Berian, Alex, et al.
Published: (2025)
by: Berian, Alex, et al.
Published: (2025)
Enhancing Multimodal Unified Representations for Cross Modal Generalization
by: Huang, Hai, et al.
Published: (2024)
by: Huang, Hai, et al.
Published: (2024)
BioVLM: Routing Prompts, Not Parameters, for Cross-Modality Generalization in Biomedical VLMs
by: Singha, Mainak, et al.
Published: (2026)
by: Singha, Mainak, et al.
Published: (2026)
A Study of Failure Modes in Two-Stage Human-Object Interaction Detection
by: Wang, Lemeng, et al.
Published: (2026)
by: Wang, Lemeng, et al.
Published: (2026)
FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection
by: Bhaskar, Paramananda, et al.
Published: (2026)
by: Bhaskar, Paramananda, et al.
Published: (2026)
Source-Free Cross-Modal Knowledge Transfer by Unleashing the Potential of Task-Irrelevant Data
by: Zhu, Jinjing, et al.
Published: (2024)
by: Zhu, Jinjing, et al.
Published: (2024)
Caption This, Reason That: VLMs Caught in the Middle
by: Weng, Zihan, et al.
Published: (2025)
by: Weng, Zihan, et al.
Published: (2025)
T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs
by: Xia, Shao-Jun, et al.
Published: (2025)
by: Xia, Shao-Jun, et al.
Published: (2025)
CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration
by: Zhang, Dongxu, et al.
Published: (2026)
by: Zhang, Dongxu, et al.
Published: (2026)
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters
by: Li, Kevin Y., et al.
Published: (2024)
by: Li, Kevin Y., et al.
Published: (2024)
Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval
by: Du, Yang, et al.
Published: (2024)
by: Du, Yang, et al.
Published: (2024)
Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving
by: Lian, Weitong, et al.
Published: (2026)
by: Lian, Weitong, et al.
Published: (2026)
SANEval: Open-Vocabulary Compositional Benchmarks with Failure-mode Diagnosis
by: Pramanik, Rishav, et al.
Published: (2026)
by: Pramanik, Rishav, et al.
Published: (2026)
Evaluating Compositional Generalisation in VLMs and Diffusion Models
by: Pearson, Beth, et al.
Published: (2025)
by: Pearson, Beth, et al.
Published: (2025)
VACoT: Rethinking Visual Data Augmentation with VLMs
by: Xu, Zhengzhuo, et al.
Published: (2025)
by: Xu, Zhengzhuo, et al.
Published: (2025)
Listener-Rewarded Thinking in VLMs for Image Preferences
by: Gambashidze, Alexander, et al.
Published: (2025)
by: Gambashidze, Alexander, et al.
Published: (2025)
Cross-Modal Learning of Housing Quality in Amsterdam
by: Levering, Alex, et al.
Published: (2024)
by: Levering, Alex, et al.
Published: (2024)
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations
by: Kim, Jeonghyeon, et al.
Published: (2025)
by: Kim, Jeonghyeon, et al.
Published: (2025)
Federated Cross-Modal Retrieval with Missing Modalities via Semantic Routing and Adapter Personalization
by: Zhou, Hefeng, et al.
Published: (2026)
by: Zhou, Hefeng, et al.
Published: (2026)
Similar Items
-
AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors
by: Zhang, Junyang, et al.
Published: (2025) -
Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption
by: Erol, Mehmet Kaan
Published: (2026) -
Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables
by: Singh, Anshul, et al.
Published: (2025) -
VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following
by: Hong, Hyesoo, et al.
Published: (2026) -
PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
by: Mak, Chak-Wing, et al.
Published: (2026)