Saved in:
Bibliographic Details
Main Authors: Zhao, Kun, Dai, Siyuan, Wang, Pan, Song, Jifeng, Ji, Hui, Lin, Chenghua, Zhan, Liang, Tang, Haoteng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.03321
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908758729293824
author Zhao, Kun
Dai, Siyuan
Wang, Pan
Song, Jifeng
Ji, Hui
Lin, Chenghua
Zhan, Liang
Tang, Haoteng
author_facet Zhao, Kun
Dai, Siyuan
Wang, Pan
Song, Jifeng
Ji, Hui
Lin, Chenghua
Zhan, Liang
Tang, Haoteng
contents Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation, yet their clinical translation is hindered by architectural heterogeneity and the prevalence of factual hallucinations. Standard supervised fine-tuning often fails to strictly align linguistic outputs with visual evidence, while existing reinforcement learning approaches struggle with either prohibitive computational costs or limited exploration. To address these challenges, we propose a comprehensive framework for self-consistent radiology report generation. First, we conduct a systematic evaluation to identify optimal vision encoder and LLM backbone configurations for medical imaging. Building on this foundation, we introduce a novel "Reason-then-Summarize" architecture optimized via Group Relative Policy Optimization (GRPO). This framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels. By utilizing a multi-dimensional composite reward function, we explicitly penalize logical discrepancies between the generated narrative and the final diagnosis. Extensive experiments on the MIMIC-CXR benchmark demonstrate that our method achieves state-of-the-art performance in clinical efficacy metrics and significantly reduces hallucinations compared to strong supervised baselines.
format Preprint
id arxiv_https___arxiv_org_abs_2601_03321
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting
Zhao, Kun
Dai, Siyuan
Wang, Pan
Song, Jifeng
Ji, Hui
Lin, Chenghua
Zhan, Liang
Tang, Haoteng
Machine Learning
Artificial Intelligence
Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation, yet their clinical translation is hindered by architectural heterogeneity and the prevalence of factual hallucinations. Standard supervised fine-tuning often fails to strictly align linguistic outputs with visual evidence, while existing reinforcement learning approaches struggle with either prohibitive computational costs or limited exploration. To address these challenges, we propose a comprehensive framework for self-consistent radiology report generation. First, we conduct a systematic evaluation to identify optimal vision encoder and LLM backbone configurations for medical imaging. Building on this foundation, we introduce a novel "Reason-then-Summarize" architecture optimized via Group Relative Policy Optimization (GRPO). This framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels. By utilizing a multi-dimensional composite reward function, we explicitly penalize logical discrepancies between the generated narrative and the final diagnosis. Extensive experiments on the MIMIC-CXR benchmark demonstrate that our method achieves state-of-the-art performance in clinical efficacy metrics and significantly reduces hallucinations compared to strong supervised baselines.
title Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2601.03321