Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.12843 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914608205266944 |
|---|---|
| author | Zhao, Yichen Peng, Zelin Tang, Fenghe Yang, Piao Huang, Yu Shen, Wei |
| author_facet | Zhao, Yichen Peng, Zelin Tang, Fenghe Yang, Piao Huang, Yu Shen, Wei |
| contents | Chest X-ray (CXR) reporting follows a region-based clinical workflow in which radiologists inspect anatomical regions and integrate localized findings into a final report. However, existing resources for CXR report generation provide these supervision signals in fragmented forms. We introduce MMRad-22K, a dataset that organizes regional textual observations, anatomical grounding coordinates, localized image evidence, and report targets into structured multimodal evidence units for CXR report generation. To motivate this formulation, we first compare different evidence formats for report generation and find that structured multimodal evidence is generally more useful than text-only or bounding box-based evidence. We then adapt a unified LVLM backbone using MMRad-22K and show that adaptation with multimodal evidence outperforms both textual-evidence adaptation and end-to-end adaptation on language and clinically oriented metrics. Under the same evaluation protocol, the adapted model also reaches a performance level comparable to several open-source LVLM references. Together, these results support MMRad-22K as a practical structured multimodal resource for training and evaluating CXR report generation aligned with clinical reading workflows. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_12843 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | MMRad-22K: A Structured Multimodal Evidence Dataset for Chest X-ray Report Generation Zhao, Yichen Peng, Zelin Tang, Fenghe Yang, Piao Huang, Yu Shen, Wei Computer Vision and Pattern Recognition Chest X-ray (CXR) reporting follows a region-based clinical workflow in which radiologists inspect anatomical regions and integrate localized findings into a final report. However, existing resources for CXR report generation provide these supervision signals in fragmented forms. We introduce MMRad-22K, a dataset that organizes regional textual observations, anatomical grounding coordinates, localized image evidence, and report targets into structured multimodal evidence units for CXR report generation. To motivate this formulation, we first compare different evidence formats for report generation and find that structured multimodal evidence is generally more useful than text-only or bounding box-based evidence. We then adapt a unified LVLM backbone using MMRad-22K and show that adaptation with multimodal evidence outperforms both textual-evidence adaptation and end-to-end adaptation on language and clinically oriented metrics. Under the same evaluation protocol, the adapted model also reaches a performance level comparable to several open-source LVLM references. Together, these results support MMRad-22K as a practical structured multimodal resource for training and evaluating CXR report generation aligned with clinical reading workflows. |
| title | MMRad-22K: A Structured Multimodal Evidence Dataset for Chest X-ray Report Generation |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2602.12843 |