Saved in:
| Main Authors: | Chen, Minbing, Meng, Zhu, Su, Fei |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.16113 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Pathological Truth Bias in Vision-Language Models
by: Thube, Yash
Published: (2025)
by: Thube, Yash
Published: (2025)
TruthPrInt: Mitigating Large Vision-Language Models Object Hallucination Via Latent Truthful-Guided Pre-Intervention
by: Duan, Jinhao, et al.
Published: (2025)
by: Duan, Jinhao, et al.
Published: (2025)
HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction
by: Yuan, Ruicheng, et al.
Published: (2026)
by: Yuan, Ruicheng, et al.
Published: (2026)
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
by: Dong, Xinpeng, et al.
Published: (2026)
by: Dong, Xinpeng, et al.
Published: (2026)
Physically Grounded Vision-Language Models for Robotic Manipulation
by: Gao, Jensen, et al.
Published: (2023)
by: Gao, Jensen, et al.
Published: (2023)
Self-Supervised Multi-Object Tracking with Path Consistency
by: Lu, Zijia, et al.
Published: (2024)
by: Lu, Zijia, et al.
Published: (2024)
GLS: Geometry-aware 3D Language Gaussian Splatting
by: Qiu, Jiaxiong, et al.
Published: (2024)
by: Qiu, Jiaxiong, et al.
Published: (2024)
GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations
by: Chen, Boyuan, et al.
Published: (2026)
by: Chen, Boyuan, et al.
Published: (2026)
VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
by: Zhao, Hongbo, et al.
Published: (2025)
by: Zhao, Hongbo, et al.
Published: (2025)
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
by: Deng, Yunlong, et al.
Published: (2025)
by: Deng, Yunlong, et al.
Published: (2025)
Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
by: Xue, Haotian, et al.
Published: (2025)
by: Xue, Haotian, et al.
Published: (2025)
IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth
by: Islam, Md Touhidul, et al.
Published: (2025)
by: Islam, Md Touhidul, et al.
Published: (2025)
MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering
by: Dang, Jisheng, et al.
Published: (2025)
by: Dang, Jisheng, et al.
Published: (2025)
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
by: Wan, David, et al.
Published: (2024)
by: Wan, David, et al.
Published: (2024)
Cost-effective Instruction Learning for Pathology Vision and Language Analysis
by: Chen, Kaitao, et al.
Published: (2024)
by: Chen, Kaitao, et al.
Published: (2024)
Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model
by: Guan, Hao, et al.
Published: (2026)
by: Guan, Hao, et al.
Published: (2026)
Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Pathology Analysis
by: Zhang, Shengxuming, et al.
Published: (2024)
by: Zhang, Shengxuming, et al.
Published: (2024)
Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding
by: Back, Kyungryul, et al.
Published: (2025)
by: Back, Kyungryul, et al.
Published: (2025)
First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models
by: Zhang, Enming, et al.
Published: (2024)
by: Zhang, Enming, et al.
Published: (2024)
Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning
by: Qu, Xiaoye, et al.
Published: (2024)
by: Qu, Xiaoye, et al.
Published: (2024)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework
by: Han, Xiao, et al.
Published: (2024)
by: Han, Xiao, et al.
Published: (2024)
TruthLens: Visual Grounding for Universal DeepFake Reasoning
by: Kundu, Rohit, et al.
Published: (2025)
by: Kundu, Rohit, et al.
Published: (2025)
HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing
by: Zhu, Zihao, et al.
Published: (2024)
by: Zhu, Zihao, et al.
Published: (2024)
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
by: Zhou, Yuhao, et al.
Published: (2026)
by: Zhou, Yuhao, et al.
Published: (2026)
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
by: Dai, Ming, et al.
Published: (2025)
by: Dai, Ming, et al.
Published: (2025)
Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency
by: Liu, Junming, et al.
Published: (2026)
by: Liu, Junming, et al.
Published: (2026)
MM-MoralBench: A MultiModal Moral Evaluation Benchmark for Large Vision-Language Models
by: Yan, Bei, et al.
Published: (2024)
by: Yan, Bei, et al.
Published: (2024)
PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis
by: Hua, Shengyi, et al.
Published: (2025)
by: Hua, Shengyi, et al.
Published: (2025)
Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation
by: Yang, Zhiyuan, et al.
Published: (2026)
by: Yang, Zhiyuan, et al.
Published: (2026)
Echo-Path: Pathology-Conditioned Echo Video Generation
by: Muhammad, Kabir Hamzah, et al.
Published: (2025)
by: Muhammad, Kabir Hamzah, et al.
Published: (2025)
Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models
by: Zeng, Fanhu, et al.
Published: (2025)
by: Zeng, Fanhu, et al.
Published: (2025)
Harnessing Large Vision and Language Models in Agriculture: A Review
by: Zhu, Hongyan, et al.
Published: (2024)
by: Zhu, Hongyan, et al.
Published: (2024)
Practical Continual Forgetting for Pre-trained Vision Models
by: Zhao, Hongbo, et al.
Published: (2025)
by: Zhao, Hongbo, et al.
Published: (2025)
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
by: Kumbhar, Shrinidhi, et al.
Published: (2026)
by: Kumbhar, Shrinidhi, et al.
Published: (2026)
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI
by: Ernhofer, Benjamin Raphael, et al.
Published: (2025)
by: Ernhofer, Benjamin Raphael, et al.
Published: (2025)
To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models
by: Aranya, OFM Riaz Rahman, et al.
Published: (2026)
by: Aranya, OFM Riaz Rahman, et al.
Published: (2026)
PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation
by: Ahmed, Faruk, et al.
Published: (2025)
by: Ahmed, Faruk, et al.
Published: (2025)
Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models
by: Yu, Keunwoo Peter, et al.
Published: (2025)
by: Yu, Keunwoo Peter, et al.
Published: (2025)
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models
by: Cheng, Zihui, et al.
Published: (2024)
by: Cheng, Zihui, et al.
Published: (2024)
TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models
by: Shao, Wenqi, et al.
Published: (2023)
by: Shao, Wenqi, et al.
Published: (2023)
Similar Items
-
Pathological Truth Bias in Vision-Language Models
by: Thube, Yash
Published: (2025) -
TruthPrInt: Mitigating Large Vision-Language Models Object Hallucination Via Latent Truthful-Guided Pre-Intervention
by: Duan, Jinhao, et al.
Published: (2025) -
HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction
by: Yuan, Ruicheng, et al.
Published: (2026) -
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
by: Dong, Xinpeng, et al.
Published: (2026) -
Physically Grounded Vision-Language Models for Robotic Manipulation
by: Gao, Jensen, et al.
Published: (2023)