Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhao, Kun, Dai, Siyuan, Wang, Pan, Song, Jifeng, Ji, Hui, Lin, Chenghua, Zhan, Liang, Tang, Haoteng
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.03321
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908758729293824
author	Zhao, Kun Dai, Siyuan Wang, Pan Song, Jifeng Ji, Hui Lin, Chenghua Zhan, Liang Tang, Haoteng
author_facet	Zhao, Kun Dai, Siyuan Wang, Pan Song, Jifeng Ji, Hui Lin, Chenghua Zhan, Liang Tang, Haoteng
contents	Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation, yet their clinical translation is hindered by architectural heterogeneity and the prevalence of factual hallucinations. Standard supervised fine-tuning often fails to strictly align linguistic outputs with visual evidence, while existing reinforcement learning approaches struggle with either prohibitive computational costs or limited exploration. To address these challenges, we propose a comprehensive framework for self-consistent radiology report generation. First, we conduct a systematic evaluation to identify optimal vision encoder and LLM backbone configurations for medical imaging. Building on this foundation, we introduce a novel "Reason-then-Summarize" architecture optimized via Group Relative Policy Optimization (GRPO). This framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels. By utilizing a multi-dimensional composite reward function, we explicitly penalize logical discrepancies between the generated narrative and the final diagnosis. Extensive experiments on the MIMIC-CXR benchmark demonstrate that our method achieves state-of-the-art performance in clinical efficacy metrics and significantly reduces hallucinations compared to strong supervised baselines.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_03321
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting Zhao, Kun Dai, Siyuan Wang, Pan Song, Jifeng Ji, Hui Lin, Chenghua Zhan, Liang Tang, Haoteng Machine Learning Artificial Intelligence Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation, yet their clinical translation is hindered by architectural heterogeneity and the prevalence of factual hallucinations. Standard supervised fine-tuning often fails to strictly align linguistic outputs with visual evidence, while existing reinforcement learning approaches struggle with either prohibitive computational costs or limited exploration. To address these challenges, we propose a comprehensive framework for self-consistent radiology report generation. First, we conduct a systematic evaluation to identify optimal vision encoder and LLM backbone configurations for medical imaging. Building on this foundation, we introduce a novel "Reason-then-Summarize" architecture optimized via Group Relative Policy Optimization (GRPO). This framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels. By utilizing a multi-dimensional composite reward function, we explicitly penalize logical discrepancies between the generated narrative and the final diagnosis. Extensive experiments on the MIMIC-CXR benchmark demonstrate that our method achieves state-of-the-art performance in clinical efficacy metrics and significantly reduces hallucinations compared to strong supervised baselines.
title	Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2601.03321

Similar Items