Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhu, Yan, Luo, Te, Fu, Pei-Yao, Zhang, Zhen, Wang, Zi-Long, Qu, Yi-Fan, Geng, Zi-Han, Xu, Jia-Qi, Yao, Lu, Ma, Li-Yun, Su, Wei, Chen, Wei-Feng, Li, Quan-Lin, Wang, Shuo, Zhou, Ping-Hong
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.08183
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909989771149312
author	Zhu, Yan Luo, Te Fu, Pei-Yao Zhang, Zhen Wang, Zi-Long Qu, Yi-Fan Geng, Zi-Han Xu, Jia-Qi Yao, Lu Ma, Li-Yun Su, Wei Chen, Wei-Feng Li, Quan-Lin Wang, Shuo Zhou, Ping-Hong
author_facet	Zhu, Yan Luo, Te Fu, Pei-Yao Zhang, Zhen Wang, Zi-Long Qu, Yi-Fan Geng, Zi-Han Xu, Jia-Qi Yao, Lu Ma, Li-Yun Su, Wei Chen, Wei-Feng Li, Quan-Lin Wang, Shuo Zhou, Ping-Hong
contents	Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical "spatial grounding bottleneck" persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a "fluency-accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to "over-interpretation" and hallucination of visual features. GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_08183
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards Zhu, Yan Luo, Te Fu, Pei-Yao Zhang, Zhen Wang, Zi-Long Qu, Yi-Fan Geng, Zi-Han Xu, Jia-Qi Yao, Lu Ma, Li-Yun Su, Wei Chen, Wei-Feng Li, Quan-Lin Wang, Shuo Zhou, Ping-Hong Computer Vision and Pattern Recognition Artificial Intelligence Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical "spatial grounding bottleneck" persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a "fluency-accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to "over-interpretation" and hallucination of visual features. GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.
title	GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2601.08183

Similar Items