Salvato in:
Dettagli Bibliografici
Autori principali: Katsarou, Katerina, Zountsas, George, Tomotaki-Dawoud, Karam, Ehrenhoefer, Alexander, Chojecki, Paul, Przewozny, David, Sauer, Igor Maximilian, Mouakher, Amira, Bosse, Sebastian
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2604.07577
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866914459156480000
author Katsarou, Katerina
Zountsas, George
Tomotaki-Dawoud, Karam
Ehrenhoefer, Alexander
Chojecki, Paul
Przewozny, David
Sauer, Igor Maximilian
Mouakher, Amira
Bosse, Sebastian
author_facet Katsarou, Katerina
Zountsas, George
Tomotaki-Dawoud, Karam
Ehrenhoefer, Alexander
Chojecki, Paul
Przewozny, David
Sauer, Igor Maximilian
Mouakher, Amira
Bosse, Sebastian
contents Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.
format Preprint
id arxiv_https___arxiv_org_abs_2604_07577
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models
Katsarou, Katerina
Zountsas, George
Tomotaki-Dawoud, Karam
Ehrenhoefer, Alexander
Chojecki, Paul
Przewozny, David
Sauer, Igor Maximilian
Mouakher, Amira
Bosse, Sebastian
Computer Vision and Pattern Recognition
Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.
title Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2604.07577