Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Bani-Harouni, David, Pellegrini, Chantal, Lüers, Julian, Kim, Su Hwan, Baalmann, Markus, Wiestler, Benedikt, Braren, Rickmer, Navab, Nassir, Keicher, Matthias
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2603.29492
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908924841558016
author	Bani-Harouni, David Pellegrini, Chantal Lüers, Julian Kim, Su Hwan Baalmann, Markus Wiestler, Benedikt Braren, Rickmer Navab, Nassir Keicher, Matthias
author_facet	Bani-Harouni, David Pellegrini, Chantal Lüers, Julian Kim, Su Hwan Baalmann, Markus Wiestler, Benedikt Braren, Rickmer Navab, Nassir Keicher, Matthias
contents	Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_29492
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Calibrated Confidence Expression for Radiology Report Generation Bani-Harouni, David Pellegrini, Chantal Lüers, Julian Kim, Su Hwan Baalmann, Markus Wiestler, Benedikt Braren, Rickmer Navab, Nassir Keicher, Matthias Computation and Language Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.
title	Calibrated Confidence Expression for Radiology Report Generation
topic	Computation and Language
url	https://arxiv.org/abs/2603.29492

Similar Items