Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mao, Yuren, Xu, Wenyi, Qin, Yuyang, Gao, Yunjun
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.16229
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916751321595904
author	Mao, Yuren Xu, Wenyi Qin, Yuyang Gao, Yunjun
author_facet	Mao, Yuren Xu, Wenyi Qin, Yuyang Gao, Yunjun
contents	Computed Tomography (CT) scan, which produces 3D volumetric medical data that can be viewed as hundreds of cross-sectional images (a.k.a. slices), provides detailed anatomical information for diagnosis. For radiologists, creating CT radiology reports is time-consuming and error-prone. A visual question answering (VQA) system that can answer radiologists' questions about some anatomical regions on the CT scan and even automatically generate a radiology report is urgently needed. However, existing VQA systems cannot adequately handle the CT radiology question answering (CTQA) task for: (1) anatomic complexity makes CT images difficult to understand; (2) spatial relationship across hundreds slices is difficult to capture. To address these issues, this paper proposes CT-Agent, a multimodal agentic framework for CTQA. CT-Agent adopts anatomically independent tools to break down the anatomic complexity; furthermore, it efficiently captures the across-slice spatial relationship with a global-local token compression strategy. Experimental results on two 3D chest CT datasets, CT-RATE and RadGenome-ChestCT, verify the superior performance of CT-Agent.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_16229
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering Mao, Yuren Xu, Wenyi Qin, Yuyang Gao, Yunjun Computer Vision and Pattern Recognition Computed Tomography (CT) scan, which produces 3D volumetric medical data that can be viewed as hundreds of cross-sectional images (a.k.a. slices), provides detailed anatomical information for diagnosis. For radiologists, creating CT radiology reports is time-consuming and error-prone. A visual question answering (VQA) system that can answer radiologists' questions about some anatomical regions on the CT scan and even automatically generate a radiology report is urgently needed. However, existing VQA systems cannot adequately handle the CT radiology question answering (CTQA) task for: (1) anatomic complexity makes CT images difficult to understand; (2) spatial relationship across hundreds slices is difficult to capture. To address these issues, this paper proposes CT-Agent, a multimodal agentic framework for CTQA. CT-Agent adopts anatomically independent tools to break down the anatomic complexity; furthermore, it efficiently captures the across-slice spatial relationship with a global-local token compression strategy. Experimental results on two 3D chest CT datasets, CT-RATE and RadGenome-ChestCT, verify the superior performance of CT-Agent.
title	CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2505.16229

Similar Items