Saved in:
Bibliographic Details
Main Authors: Zhu, Yan, Luo, Te, Fu, Pei-Yao, Zhang, Zhen, Wang, Zi-Long, Qu, Yi-Fan, Geng, Zi-Han, Xu, Jia-Qi, Yao, Lu, Ma, Li-Yun, Su, Wei, Chen, Wei-Feng, Li, Quan-Lin, Wang, Shuo, Zhou, Ping-Hong
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.08183
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical "spatial grounding bottleneck" persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a "fluency-accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to "over-interpretation" and hallucination of visual features. GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.