:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chen, Minbing, Meng, Zhu, Su, Fei
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.16113
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Pathological Truth Bias in Vision-Language Models
by: Thube, Yash
Published: (2025)

TruthPrInt: Mitigating Large Vision-Language Models Object Hallucination Via Latent Truthful-Guided Pre-Intervention
by: Duan, Jinhao, et al.
Published: (2025)

HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction
by: Yuan, Ruicheng, et al.
Published: (2026)

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
by: Dong, Xinpeng, et al.
Published: (2026)

Physically Grounded Vision-Language Models for Robotic Manipulation
by: Gao, Jensen, et al.
Published: (2023)

Self-Supervised Multi-Object Tracking with Path Consistency
by: Lu, Zijia, et al.
Published: (2024)

GLS: Geometry-aware 3D Language Gaussian Splatting
by: Qiu, Jiaxiong, et al.
Published: (2024)

GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations
by: Chen, Boyuan, et al.
Published: (2026)

VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
by: Zhao, Hongbo, et al.
Published: (2025)

Towards Self-Refinement of Vision-Language Models with Triangular Consistency
by: Deng, Yunlong, et al.
Published: (2025)

Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
by: Xue, Haotian, et al.
Published: (2025)

IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth
by: Islam, Md Touhidul, et al.
Published: (2025)

MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering
by: Dang, Jisheng, et al.
Published: (2025)

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
by: Wan, David, et al.
Published: (2024)

Cost-effective Instruction Learning for Pathology Vision and Language Analysis
by: Chen, Kaitao, et al.
Published: (2024)

Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model
by: Guan, Hao, et al.
Published: (2026)

Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Pathology Analysis
by: Zhang, Shengxuming, et al.
Published: (2024)

Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding
by: Back, Kyungryul, et al.
Published: (2025)

First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models
by: Zhang, Enming, et al.
Published: (2024)

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning
by: Qu, Xiaoye, et al.
Published: (2024)

Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework
by: Han, Xiao, et al.
Published: (2024)

TruthLens: Visual Grounding for Universal DeepFake Reasoning
by: Kundu, Rohit, et al.
Published: (2025)

HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing
by: Zhu, Zihao, et al.
Published: (2024)

ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
by: Zhou, Yuhao, et al.
Published: (2026)

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
by: Dai, Ming, et al.
Published: (2025)

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency
by: Liu, Junming, et al.
Published: (2026)

MM-MoralBench: A MultiModal Moral Evaluation Benchmark for Large Vision-Language Models
by: Yan, Bei, et al.
Published: (2024)

PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis
by: Hua, Shengyi, et al.
Published: (2025)

Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation
by: Yang, Zhiyuan, et al.
Published: (2026)

Echo-Path: Pathology-Conditioned Echo Video Generation
by: Muhammad, Kabir Hamzah, et al.
Published: (2025)

Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models
by: Zeng, Fanhu, et al.
Published: (2025)

Harnessing Large Vision and Language Models in Agriculture: A Review
by: Zhu, Hongyan, et al.
Published: (2024)

Practical Continual Forgetting for Pre-trained Vision Models
by: Zhao, Hongbo, et al.
Published: (2025)

Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
by: Kumbhar, Shrinidhi, et al.
Published: (2026)

Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI
by: Ernhofer, Benjamin Raphael, et al.
Published: (2025)

To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models
by: Aranya, OFM Riaz Rahman, et al.
Published: (2026)

PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation
by: Ahmed, Faruk, et al.
Published: (2025)

Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models
by: Yu, Keunwoo Peter, et al.
Published: (2025)

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models
by: Cheng, Zihui, et al.
Published: (2024)

TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models
by: Shao, Wenqi, et al.
Published: (2023)