Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Song, Ziyang, Zang, Zelin, Ye, Xiaofan, Xu, Boqiang, Bai, Long, Wu, Jinlin, Ren, Hongliang, Liu, Hongbin, Luo, Jiebo, Lei, Zhen
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2512.06921
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915659567333376
author	Song, Ziyang Zang, Zelin Ye, Xiaofan Xu, Boqiang Bai, Long Wu, Jinlin Ren, Hongliang Liu, Hongbin Luo, Jiebo Lei, Zhen
author_facet	Song, Ziyang Zang, Zelin Ye, Xiaofan Xu, Boqiang Bai, Long Wu, Jinlin Ren, Hongliang Liu, Hongbin Luo, Jiebo Lei, Zhen
contents	Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group's average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_06921
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification Song, Ziyang Zang, Zelin Ye, Xiaofan Xu, Boqiang Bai, Long Wu, Jinlin Ren, Hongliang Liu, Hongbin Luo, Jiebo Lei, Zhen Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group's average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.
title	NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification
topic	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2512.06921

Similar Items