Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Pezold, Simon, Kurylec, Jérôme A., Liechti, Jan S., Müller, Beat P., Lavanchy, Joël L.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.06831
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916939967758336
author	Pezold, Simon Kurylec, Jérôme A. Liechti, Jan S. Müller, Beat P. Lavanchy, Joël L.
author_facet	Pezold, Simon Kurylec, Jérôme A. Liechti, Jan S. Müller, Beat P. Lavanchy, Joël L.
contents	We investigate how both the adaptation of a generic foundation model via transfer learning and the integration of complementary modalities from the operating room (OR) can support surgical data science. To this end, we use V-JEPA as the single-modality foundation of a multimodal model for minimally invasive surgery support. We analyze how the model's downstream performance can benefit (a) from finetuning on unlabeled surgical video data and (b) from providing additional time-resolved data streams from the OR in a multimodal setup. In an in-house dataset of liver surgery videos, we analyze the tasks of predicting hospital length of stay and postoperative complications. In videos of the public HeiCo dataset, we analyze the task of surgical phase recognition. As a baseline, we apply pretrained V-JEPA to all tasks. We then finetune it on unlabeled, held-out videos to investigate its change in performance after domain adaptation. Following the idea of modular decision support networks, we integrate additional data streams from the OR by training a separate encoder to form a shared representation space with V-JEPA's embeddings. Our experiments show that finetuning on domain-specific data increases model performance. On the in-house data, integrating additional time-resolved data likewise benefits the model. On the HeiCo data, accuracy of the pretrained video-only, single-modality baseline setup is on par with the top-performing submissions of the EndoVis2017 challenge, while finetuning on domain-specific data increases accuracy further. Our results thus demonstrate how surgical data science can leverage public, generic foundation models. Likewise, they indicate the potential of domain adaptation and of integrating suitable complementary data streams from the OR. To support further research, we release our code and model weights at https://github.com/DigitalSurgeryLab-Basel/ML-CDS-2025.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_06831
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Leveraging Generic Foundation Models for Multimodal Surgical Data Analysis Pezold, Simon Kurylec, Jérôme A. Liechti, Jan S. Müller, Beat P. Lavanchy, Joël L. Computer Vision and Pattern Recognition We investigate how both the adaptation of a generic foundation model via transfer learning and the integration of complementary modalities from the operating room (OR) can support surgical data science. To this end, we use V-JEPA as the single-modality foundation of a multimodal model for minimally invasive surgery support. We analyze how the model's downstream performance can benefit (a) from finetuning on unlabeled surgical video data and (b) from providing additional time-resolved data streams from the OR in a multimodal setup. In an in-house dataset of liver surgery videos, we analyze the tasks of predicting hospital length of stay and postoperative complications. In videos of the public HeiCo dataset, we analyze the task of surgical phase recognition. As a baseline, we apply pretrained V-JEPA to all tasks. We then finetune it on unlabeled, held-out videos to investigate its change in performance after domain adaptation. Following the idea of modular decision support networks, we integrate additional data streams from the OR by training a separate encoder to form a shared representation space with V-JEPA's embeddings. Our experiments show that finetuning on domain-specific data increases model performance. On the in-house data, integrating additional time-resolved data likewise benefits the model. On the HeiCo data, accuracy of the pretrained video-only, single-modality baseline setup is on par with the top-performing submissions of the EndoVis2017 challenge, while finetuning on domain-specific data increases accuracy further. Our results thus demonstrate how surgical data science can leverage public, generic foundation models. Likewise, they indicate the potential of domain adaptation and of integrating suitable complementary data streams from the OR. To support further research, we release our code and model weights at https://github.com/DigitalSurgeryLab-Basel/ML-CDS-2025.
title	Leveraging Generic Foundation Models for Multimodal Surgical Data Analysis
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2509.06831

Similar Items