Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Naznin, Mst. Fahmida Sultana, Faruq, Adnan Ibney, Rahman, Mushfiqur, Mondal, Niloy Kumar, Shawon, Md. Mehedi Hasan, Hasan, Md Rakibul
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2603.29901
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914435450273792
author	Naznin, Mst. Fahmida Sultana Faruq, Adnan Ibney Rahman, Mushfiqur Mondal, Niloy Kumar Shawon, Md. Mehedi Hasan Hasan, Md Rakibul
author_facet	Naznin, Mst. Fahmida Sultana Faruq, Adnan Ibney Rahman, Mushfiqur Mondal, Niloy Kumar Shawon, Md. Mehedi Hasan Hasan, Md Rakibul
contents	Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_29901
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization Naznin, Mst. Fahmida Sultana Faruq, Adnan Ibney Rahman, Mushfiqur Mondal, Niloy Kumar Shawon, Md. Mehedi Hasan Hasan, Md Rakibul Computer Vision and Pattern Recognition Computation and Language Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.
title	Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization
topic	Computer Vision and Pattern Recognition Computation and Language
url	https://arxiv.org/abs/2603.29901

Similar Items