Saved in:
Bibliographic Details
Main Authors: Zhu, Yan, Luo, Te, Fu, Pei-Yao, Zhang, Zhen, Wang, Zi-Long, Qu, Yi-Fan, Geng, Zi-Han, Xu, Jia-Qi, Yao, Lu, Ma, Li-Yun, Su, Wei, Chen, Wei-Feng, Li, Quan-Lin, Wang, Shuo, Zhou, Ping-Hong
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.08183
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909989771149312
author Zhu, Yan
Luo, Te
Fu, Pei-Yao
Zhang, Zhen
Wang, Zi-Long
Qu, Yi-Fan
Geng, Zi-Han
Xu, Jia-Qi
Yao, Lu
Ma, Li-Yun
Su, Wei
Chen, Wei-Feng
Li, Quan-Lin
Wang, Shuo
Zhou, Ping-Hong
author_facet Zhu, Yan
Luo, Te
Fu, Pei-Yao
Zhang, Zhen
Wang, Zi-Long
Qu, Yi-Fan
Geng, Zi-Han
Xu, Jia-Qi
Yao, Lu
Ma, Li-Yun
Su, Wei
Chen, Wei-Feng
Li, Quan-Lin
Wang, Shuo
Zhou, Ping-Hong
contents Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical "spatial grounding bottleneck" persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a "fluency-accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to "over-interpretation" and hallucination of visual features. GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.
format Preprint
id arxiv_https___arxiv_org_abs_2601_08183
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards
Zhu, Yan
Luo, Te
Fu, Pei-Yao
Zhang, Zhen
Wang, Zi-Long
Qu, Yi-Fan
Geng, Zi-Han
Xu, Jia-Qi
Yao, Lu
Ma, Li-Yun
Su, Wei
Chen, Wei-Feng
Li, Quan-Lin
Wang, Shuo
Zhou, Ping-Hong
Computer Vision and Pattern Recognition
Artificial Intelligence
Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical "spatial grounding bottleneck" persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a "fluency-accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to "over-interpretation" and hallucination of visual features. GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.
title GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2601.08183