:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	He, Haoyu, Zhuo, Yue, Zheng, Yu, Wang, Qi R.
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.27070
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation
by: Wang, Haoyu, et al.
Published: (2026)

Probing and Inducing Combinational Creativity in Vision-Language Models
by: Peng, Yongqian, et al.
Published: (2025)

Probing Perceptual Constancy in Large Vision-Language Models
by: Sun, Haoran, et al.
Published: (2025)

UMIT: Unifying Medical Imaging Tasks via Vision-Language Models
by: Yu, Haiyang, et al.
Published: (2025)

Vision Language Models for Spreadsheet Understanding: Challenges and Opportunities
by: Xia, Shiyu, et al.
Published: (2024)

TraversalBench: Challenging Paths to Follow for Vision Language Models
by: Petrova, Clara, et al.
Published: (2026)

Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales
by: Qian, Zhaofang, et al.
Published: (2025)

From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models
by: Han, Tiancheng, et al.
Published: (2025)

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
by: Li, Rongjie, et al.
Published: (2024)

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models
by: Min, Cheolhong, et al.
Published: (2026)

DexVLG: Dexterous Vision-Language-Grasp Model at Scale
by: He, Jiawei, et al.
Published: (2025)

HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding
by: Xia, Peng, et al.
Published: (2023)

Joint Vision-Language Social Bias Removal for CLIP
by: Zhang, Haoyu, et al.
Published: (2024)

GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation
by: Yin, Hang, et al.
Published: (2025)

RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence
by: He, Xuming, et al.
Published: (2025)

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach
by: Guan, Jiwei, et al.
Published: (2024)

Vision Large Language Models Are Good Noise Handlers in Engagement Analysis
by: Vedernikov, Alexander, et al.
Published: (2025)

Conceptual Codebook Learning for Vision-Language Models
by: Zhang, Yi, et al.
Published: (2024)

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage
by: Wang, Yu, et al.
Published: (2024)

Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models
by: Yang, Fan, et al.
Published: (2024)

Language-Guided Invariance Probing of Vision-Language Models
by: Lee, Jae Joong
Published: (2025)

Co-Training Vision Language Models for Remote Sensing Multi-task Learning
by: Li, Qingyun, et al.
Published: (2025)

VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck
by: Zhang, Feiran, et al.
Published: (2026)

Integrated Structural Prompt Learning for Vision-Language Models
by: Wang, Jiahui, et al.
Published: (2025)

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning
by: Huang, Zhuo, et al.
Published: (2023)

Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure
by: Li, Zitong, et al.
Published: (2026)

ESceme: Vision-and-Language Navigation with Episodic Scene Memory
by: Zheng, Qi, et al.
Published: (2023)

Hierarchical Cross-modal Prompt Learning for Vision-Language Models
by: Zheng, Hao, et al.
Published: (2025)

Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
by: Wang, Chenfeng, et al.
Published: (2026)

Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving
by: Zhao, Zongchuang, et al.
Published: (2025)

Safety Alignment for Vision Language Models
by: Liu, Zhendong, et al.
Published: (2024)

Enhancing Vision-Language Model with Unmasked Token Alignment
by: Liu, Jihao, et al.
Published: (2024)

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification
by: Peng, Wenshuo, et al.
Published: (2024)

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding
by: Xuan, Weihao, et al.
Published: (2025)

Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models
by: Zheng, Yue, et al.
Published: (2025)

Can Large Vision-Language Models Understand Multimodal Sarcasm?
by: Wang, Xinyu, et al.
Published: (2025)

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
by: Lu, Fan, et al.
Published: (2024)

Reasoning or Pattern Matching? Probing Large Vision-Language Models with Visual Puzzles
by: Lymperaiou, Maria, et al.
Published: (2026)

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
by: Jia, Mengdi, et al.
Published: (2025)

UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
by: Jiang, Haoyu, et al.
Published: (2024)