MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Katsarou, Katerina, Zountsas, George, Tomotaki-Dawoud, Karam, Ehrenhoefer, Alexander, Chojecki, Paul, Przewozny, David, Sauer, Igor Maximilian, Mouakher, Amira, Bosse, Sebastian
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Computer Vision and Pattern Recognition
Accesso online:	https://arxiv.org/abs/2604.07577
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866914459156480000
author	Katsarou, Katerina Zountsas, George Tomotaki-Dawoud, Karam Ehrenhoefer, Alexander Chojecki, Paul Przewozny, David Sauer, Igor Maximilian Mouakher, Amira Bosse, Sebastian
author_facet	Katsarou, Katerina Zountsas, George Tomotaki-Dawoud, Karam Ehrenhoefer, Alexander Chojecki, Paul Przewozny, David Sauer, Igor Maximilian Mouakher, Amira Bosse, Sebastian
contents	Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_07577
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models Katsarou, Katerina Zountsas, George Tomotaki-Dawoud, Karam Ehrenhoefer, Alexander Chojecki, Paul Przewozny, David Sauer, Igor Maximilian Mouakher, Amira Bosse, Sebastian Computer Vision and Pattern Recognition Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.
title	Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.07577

Documenti analoghi