Saved in:
Bibliographic Details
Main Authors: Zhao, Yichen, Peng, Zelin, Tang, Fenghe, Yang, Piao, Huang, Yu, Shen, Wei
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.12843
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914608205266944
author Zhao, Yichen
Peng, Zelin
Tang, Fenghe
Yang, Piao
Huang, Yu
Shen, Wei
author_facet Zhao, Yichen
Peng, Zelin
Tang, Fenghe
Yang, Piao
Huang, Yu
Shen, Wei
contents Chest X-ray (CXR) reporting follows a region-based clinical workflow in which radiologists inspect anatomical regions and integrate localized findings into a final report. However, existing resources for CXR report generation provide these supervision signals in fragmented forms. We introduce MMRad-22K, a dataset that organizes regional textual observations, anatomical grounding coordinates, localized image evidence, and report targets into structured multimodal evidence units for CXR report generation. To motivate this formulation, we first compare different evidence formats for report generation and find that structured multimodal evidence is generally more useful than text-only or bounding box-based evidence. We then adapt a unified LVLM backbone using MMRad-22K and show that adaptation with multimodal evidence outperforms both textual-evidence adaptation and end-to-end adaptation on language and clinically oriented metrics. Under the same evaluation protocol, the adapted model also reaches a performance level comparable to several open-source LVLM references. Together, these results support MMRad-22K as a practical structured multimodal resource for training and evaluating CXR report generation aligned with clinical reading workflows.
format Preprint
id arxiv_https___arxiv_org_abs_2602_12843
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MMRad-22K: A Structured Multimodal Evidence Dataset for Chest X-ray Report Generation
Zhao, Yichen
Peng, Zelin
Tang, Fenghe
Yang, Piao
Huang, Yu
Shen, Wei
Computer Vision and Pattern Recognition
Chest X-ray (CXR) reporting follows a region-based clinical workflow in which radiologists inspect anatomical regions and integrate localized findings into a final report. However, existing resources for CXR report generation provide these supervision signals in fragmented forms. We introduce MMRad-22K, a dataset that organizes regional textual observations, anatomical grounding coordinates, localized image evidence, and report targets into structured multimodal evidence units for CXR report generation. To motivate this formulation, we first compare different evidence formats for report generation and find that structured multimodal evidence is generally more useful than text-only or bounding box-based evidence. We then adapt a unified LVLM backbone using MMRad-22K and show that adaptation with multimodal evidence outperforms both textual-evidence adaptation and end-to-end adaptation on language and clinically oriented metrics. Under the same evaluation protocol, the adapted model also reaches a performance level comparable to several open-source LVLM references. Together, these results support MMRad-22K as a practical structured multimodal resource for training and evaluating CXR report generation aligned with clinical reading workflows.
title MMRad-22K: A Structured Multimodal Evidence Dataset for Chest X-ray Report Generation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.12843