Guardado en:
Detalles Bibliográficos
Autores principales: Jiang, Jian, Lin, Chenxi, Gu, Yiming, Qin, Zengyi, Zeng, Zhitao, Yuan, Kun, Long, Yonghao, Xia, Xiang, Yuan, Cheng, Wang, Yuqi, Yue, Zijie, Yang, Kunyi, Zhang, Yuting, Zhuo, Zhu, Qin, Dian, Wang, Xin, Fai, NG Chi, Anthony, Brian, Xu, Daguang, Rosman, Guy, Meireles, Ozanan, Zhang, Zizhen, Padoy, Nicolas, Wang, Hesheng, Dou, Qi, Jin, Yueming, Ban, Yutong
Formato: Preprint
Publicado: 2026
Materias:
Acceso en línea:https://arxiv.org/abs/2603.12430
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866908883227770880
author Jiang, Jian
Lin, Chenxi
Gu, Yiming
Qin, Zengyi
Zeng, Zhitao
Yuan, Kun
Long, Yonghao
Xia, Xiang
Yuan, Cheng
Wang, Yuqi
Yue, Zijie
Yang, Kunyi
Zhang, Yuting
Zhuo, Zhu
Qin, Dian
Wang, Xin
Fai, NG Chi
Anthony, Brian
Xu, Daguang
Rosman, Guy
Meireles, Ozanan
Zhang, Zizhen
Padoy, Nicolas
Wang, Hesheng
Dou, Qi
Jin, Yueming
Ban, Yutong
author_facet Jiang, Jian
Lin, Chenxi
Gu, Yiming
Qin, Zengyi
Zeng, Zhitao
Yuan, Kun
Long, Yonghao
Xia, Xiang
Yuan, Cheng
Wang, Yuqi
Yue, Zijie
Yang, Kunyi
Zhang, Yuting
Zhuo, Zhu
Qin, Dian
Wang, Xin
Fai, NG Chi
Anthony, Brian
Xu, Daguang
Rosman, Guy
Meireles, Ozanan
Zhang, Zizhen
Padoy, Nicolas
Wang, Hesheng
Dou, Qi
Jin, Yueming
Ban, Yutong
contents Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement. Evaluation on SurgBench, comprising six public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (64.9%) on public benchmarks versus Gemini 3.0 Pro (46.1%) and GPT-5.1 (37.9%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning instrument localization, triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.
format Preprint
id arxiv_https___arxiv_org_abs_2603_12430
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation
Jiang, Jian
Lin, Chenxi
Gu, Yiming
Qin, Zengyi
Zeng, Zhitao
Yuan, Kun
Long, Yonghao
Xia, Xiang
Yuan, Cheng
Wang, Yuqi
Yue, Zijie
Yang, Kunyi
Zhang, Yuting
Zhuo, Zhu
Qin, Dian
Wang, Xin
Fai, NG Chi
Anthony, Brian
Xu, Daguang
Rosman, Guy
Meireles, Ozanan
Zhang, Zizhen
Padoy, Nicolas
Wang, Hesheng
Dou, Qi
Jin, Yueming
Ban, Yutong
Computer Vision and Pattern Recognition
Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement. Evaluation on SurgBench, comprising six public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (64.9%) on public benchmarks versus Gemini 3.0 Pro (46.1%) and GPT-5.1 (37.9%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning instrument localization, triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.
title Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.12430