Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Liao, Zehui, Hu, Shishuai, Zou, Ke, Jin, Mengyuan, Zhang, Yanning, Fu, Huazhu, Zhen, Liangli, Xia, Yong
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2503.20504
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866917246247370752
author Liao, Zehui
Hu, Shishuai
Zou, Ke
Jin, Mengyuan
Zhang, Yanning
Fu, Huazhu
Zhen, Liangli
Xia, Yong
author_facet Liao, Zehui
Hu, Shishuai
Zou, Ke
Jin, Mengyuan
Zhang, Yanning
Fu, Huazhu
Zhen, Liangli
Xia, Yong
contents Vision-language models (VLMs) have great potential for medical image understanding, particularly in Visual Report Generation (VRG) and Visual Question Answering (VQA), but they may generate hallucinated responses that contradict visual evidence, limiting clinical deployment. Although uncertainty-based hallucination detection methods are intuitive and effective, they are limited in medical VLMs. Specifically, Semantic Entropy (SE), effective in text-only LLMs, becomes less reliable in medical VLMs due to their overconfidence from strong language priors. To address this challenge, we propose UniVRSE, a Unified Vision-conditioned Response Semantic Entropy framework for hallucination detection in medical VLMs. UniVRSE strengthens visual guidance during uncertainty estimation by contrasting the semantic predictive distributions derived from an original image-text pair and a visually distorted counterpart, with higher entropy indicating hallucination risk. For VQA, UniVRSE works on the image-question pair, while for VRG, it decomposes the report into claims, generates verification questions, and applies vision-conditioned entropy estimation at the claim level. To evaluate hallucination detection, we propose a unified pipeline that generates responses on medical datasets and derives hallucination labels via factual consistency assessment. However, current evaluation methods rely on subjective criteria or modality-specific rules. To improve reliability, we introduce Alignment Ratio of Atomic Facts (ALFA), a novel method that quantifies fine-grained factual consistency. ALFA-derived labels provide ground truth for robust benchmarking. Experiments on six medical VQA/VRG datasets and three VLMs show UniVRSE significantly outperforms existing methods with strong cross-modal generalization.
format Preprint
id arxiv_https___arxiv_org_abs_2503_20504
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle UniVRSE: Unified Vision-conditioned Response Semantic Entropy for Hallucination Detection in Medical Vision-Language Models
Liao, Zehui
Hu, Shishuai
Zou, Ke
Jin, Mengyuan
Zhang, Yanning
Fu, Huazhu
Zhen, Liangli
Xia, Yong
Computer Vision and Pattern Recognition
Vision-language models (VLMs) have great potential for medical image understanding, particularly in Visual Report Generation (VRG) and Visual Question Answering (VQA), but they may generate hallucinated responses that contradict visual evidence, limiting clinical deployment. Although uncertainty-based hallucination detection methods are intuitive and effective, they are limited in medical VLMs. Specifically, Semantic Entropy (SE), effective in text-only LLMs, becomes less reliable in medical VLMs due to their overconfidence from strong language priors. To address this challenge, we propose UniVRSE, a Unified Vision-conditioned Response Semantic Entropy framework for hallucination detection in medical VLMs. UniVRSE strengthens visual guidance during uncertainty estimation by contrasting the semantic predictive distributions derived from an original image-text pair and a visually distorted counterpart, with higher entropy indicating hallucination risk. For VQA, UniVRSE works on the image-question pair, while for VRG, it decomposes the report into claims, generates verification questions, and applies vision-conditioned entropy estimation at the claim level. To evaluate hallucination detection, we propose a unified pipeline that generates responses on medical datasets and derives hallucination labels via factual consistency assessment. However, current evaluation methods rely on subjective criteria or modality-specific rules. To improve reliability, we introduce Alignment Ratio of Atomic Facts (ALFA), a novel method that quantifies fine-grained factual consistency. ALFA-derived labels provide ground truth for robust benchmarking. Experiments on six medical VQA/VRG datasets and three VLMs show UniVRSE significantly outperforms existing methods with strong cross-modal generalization.
title UniVRSE: Unified Vision-conditioned Response Semantic Entropy for Hallucination Detection in Medical Vision-Language Models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2503.20504