Saved in:
Bibliographic Details
Main Authors: Zhou, Rulin, Wang, Guankun, Wang, An, Ma, Yujie, Ouyang, Lixin, Cui, Bolin, Li, Junyan, Zhu, Chaowei, Li, Mingyang, Chen, Ming, Zhong, Xiaopin, Lu, Peng, Wang, Jiankun, Liu, Xianming, Ren, Hongliang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.20636
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Accurate and stable field-of-view (FoV) guidance is critical for safe and efficient minimally invasive surgery, yet existing approaches often conflate visual attention estimation with downstream camera control or rely on direct object-centric assumptions. In this work, we formulate surgical attention tracking as a spatio-temporal learning problem and model surgeon focus as a dense attention heatmap, enabling continuous and interpretable frame-wise FoV guidance. We propose SurgAtt-Tracker, a holistic framework that robustly tracks surgical attention by exploiting temporal coherence through proposal-level reranking and motion-aware refinement, rather than direct regression. To support systematic training and evaluation, we introduce SurgAtt-1.16M, a large-scale benchmark with a clinically grounded annotation protocol that enables comprehensive heatmap-based attention analysis across procedures and institutions. Extensive experiments on multiple surgical datasets demonstrate that SurgAtt-Tracker consistently achieves state-of-the-art performance and strong robustness under occlusion, multi-instrument interference, and cross-domain settings. Beyond attention tracking, our approach provides a frame-wise FoV guidance signal that can directly support downstream robotic FoV planning and automatic camera control.