Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nguyen, Dung, Ho, Minh Khoi, Ta, Huy, Nguyen, Thanh Tam, Chen, Qi, Rav, Kumar, Dang, Quy Duong, Ramchandre, Satwik, Phung, Son Lam, Liao, Zhibin, To, Minh-Son, Verjans, Johan, Nguyen, Phi Le, Phan, Vu Minh Hieu
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.00744
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917339539177472
author	Nguyen, Dung Ho, Minh Khoi Ta, Huy Nguyen, Thanh Tam Chen, Qi Rav, Kumar Dang, Quy Duong Ramchandre, Satwik Phung, Son Lam Liao, Zhibin To, Minh-Son Verjans, Johan Nguyen, Phi Le Phan, Vu Minh Hieu
author_facet	Nguyen, Dung Ho, Minh Khoi Ta, Huy Nguyen, Thanh Tam Chen, Qi Rav, Kumar Dang, Quy Duong Ramchandre, Satwik Phung, Son Lam Liao, Zhibin To, Minh-Son Verjans, Johan Nguyen, Phi Le Phan, Vu Minh Hieu
contents	Medical Large Multi-modal Models (LMMs) have demonstrated remarkable capabilities in medical data interpretation. However, these models frequently generate hallucinations contradicting source evidence, particularly due to inadequate localization reasoning. This work reveals a critical limitation in current medical LMMs: instead of analyzing relevant pathological regions, they often rely on linguistic patterns or attend to irrelevant image areas when responding to disease-related queries. To address this, we introduce HEAL-MedVQA (Hallucination Evaluation via Localization MedVQA), a comprehensive benchmark designed to evaluate LMMs' localization abilities and hallucination robustness. HEAL-MedVQA features (i) two innovative evaluation protocols to assess visual and textual shortcut learning, and (ii) a dataset of 67K VQA pairs, with doctor-annotated anatomical segmentation masks for pathological regions. To improve visual reasoning, we propose the Localize-before-Answer (LobA) framework, which trains LMMs to localize target regions of interest and self-prompt to emphasize segmented pathological areas, generating grounded and reliable answers. Experimental results demonstrate that our approach significantly outperforms state-of-the-art biomedical LMMs on the challenging HEAL-MedVQA benchmark, advancing robustness in medical VQA.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_00744
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs Nguyen, Dung Ho, Minh Khoi Ta, Huy Nguyen, Thanh Tam Chen, Qi Rav, Kumar Dang, Quy Duong Ramchandre, Satwik Phung, Son Lam Liao, Zhibin To, Minh-Son Verjans, Johan Nguyen, Phi Le Phan, Vu Minh Hieu Computer Vision and Pattern Recognition Medical Large Multi-modal Models (LMMs) have demonstrated remarkable capabilities in medical data interpretation. However, these models frequently generate hallucinations contradicting source evidence, particularly due to inadequate localization reasoning. This work reveals a critical limitation in current medical LMMs: instead of analyzing relevant pathological regions, they often rely on linguistic patterns or attend to irrelevant image areas when responding to disease-related queries. To address this, we introduce HEAL-MedVQA (Hallucination Evaluation via Localization MedVQA), a comprehensive benchmark designed to evaluate LMMs' localization abilities and hallucination robustness. HEAL-MedVQA features (i) two innovative evaluation protocols to assess visual and textual shortcut learning, and (ii) a dataset of 67K VQA pairs, with doctor-annotated anatomical segmentation masks for pathological regions. To improve visual reasoning, we propose the Localize-before-Answer (LobA) framework, which trains LMMs to localize target regions of interest and self-prompt to emphasize segmented pathological areas, generating grounded and reliable answers. Experimental results demonstrate that our approach significantly outperforms state-of-the-art biomedical LMMs on the challenging HEAL-MedVQA benchmark, advancing robustness in medical VQA.
title	Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2505.00744

Similar Items