Gespeichert in:
| Hauptverfasser: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Veröffentlicht: |
2025
|
| Schlagworte: | |
| Online-Zugang: | https://arxiv.org/abs/2503.20504 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| _version_ | 1866917246247370752 |
|---|---|
| author | Liao, Zehui Hu, Shishuai Zou, Ke Jin, Mengyuan Zhang, Yanning Fu, Huazhu Zhen, Liangli Xia, Yong |
| author_facet | Liao, Zehui Hu, Shishuai Zou, Ke Jin, Mengyuan Zhang, Yanning Fu, Huazhu Zhen, Liangli Xia, Yong |
| contents | Vision-language models (VLMs) have great potential for medical image understanding, particularly in Visual Report Generation (VRG) and Visual Question Answering (VQA), but they may generate hallucinated responses that contradict visual evidence, limiting clinical deployment. Although uncertainty-based hallucination detection methods are intuitive and effective, they are limited in medical VLMs. Specifically, Semantic Entropy (SE), effective in text-only LLMs, becomes less reliable in medical VLMs due to their overconfidence from strong language priors. To address this challenge, we propose UniVRSE, a Unified Vision-conditioned Response Semantic Entropy framework for hallucination detection in medical VLMs. UniVRSE strengthens visual guidance during uncertainty estimation by contrasting the semantic predictive distributions derived from an original image-text pair and a visually distorted counterpart, with higher entropy indicating hallucination risk. For VQA, UniVRSE works on the image-question pair, while for VRG, it decomposes the report into claims, generates verification questions, and applies vision-conditioned entropy estimation at the claim level. To evaluate hallucination detection, we propose a unified pipeline that generates responses on medical datasets and derives hallucination labels via factual consistency assessment. However, current evaluation methods rely on subjective criteria or modality-specific rules. To improve reliability, we introduce Alignment Ratio of Atomic Facts (ALFA), a novel method that quantifies fine-grained factual consistency. ALFA-derived labels provide ground truth for robust benchmarking. Experiments on six medical VQA/VRG datasets and three VLMs show UniVRSE significantly outperforms existing methods with strong cross-modal generalization. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2503_20504 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | UniVRSE: Unified Vision-conditioned Response Semantic Entropy for Hallucination Detection in Medical Vision-Language Models Liao, Zehui Hu, Shishuai Zou, Ke Jin, Mengyuan Zhang, Yanning Fu, Huazhu Zhen, Liangli Xia, Yong Computer Vision and Pattern Recognition Vision-language models (VLMs) have great potential for medical image understanding, particularly in Visual Report Generation (VRG) and Visual Question Answering (VQA), but they may generate hallucinated responses that contradict visual evidence, limiting clinical deployment. Although uncertainty-based hallucination detection methods are intuitive and effective, they are limited in medical VLMs. Specifically, Semantic Entropy (SE), effective in text-only LLMs, becomes less reliable in medical VLMs due to their overconfidence from strong language priors. To address this challenge, we propose UniVRSE, a Unified Vision-conditioned Response Semantic Entropy framework for hallucination detection in medical VLMs. UniVRSE strengthens visual guidance during uncertainty estimation by contrasting the semantic predictive distributions derived from an original image-text pair and a visually distorted counterpart, with higher entropy indicating hallucination risk. For VQA, UniVRSE works on the image-question pair, while for VRG, it decomposes the report into claims, generates verification questions, and applies vision-conditioned entropy estimation at the claim level. To evaluate hallucination detection, we propose a unified pipeline that generates responses on medical datasets and derives hallucination labels via factual consistency assessment. However, current evaluation methods rely on subjective criteria or modality-specific rules. To improve reliability, we introduce Alignment Ratio of Atomic Facts (ALFA), a novel method that quantifies fine-grained factual consistency. ALFA-derived labels provide ground truth for robust benchmarking. Experiments on six medical VQA/VRG datasets and three VLMs show UniVRSE significantly outperforms existing methods with strong cross-modal generalization. |
| title | UniVRSE: Unified Vision-conditioned Response Semantic Entropy for Hallucination Detection in Medical Vision-Language Models |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2503.20504 |