Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Kondo, Satoshi
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Image and Video Processing
Online Access:	https://arxiv.org/abs/2505.13746
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918025640280064
author	Kondo, Satoshi
author_facet	Kondo, Satoshi
contents	Surgical phase recognition from video is a technology that automatically classifies the progress of a surgical procedure and has a wide range of potential applications, including real-time surgical support, optimization of medical resources, training and skill assessment, and safety improvement. Recent advances in surgical phase recognition technology have focused primarily on Transform-based methods, although methods that extract spatial features from individual frames using a CNN and video features from the resulting time series of spatial features using time series modeling have shown high performance. However, there remains a paucity of research on training methods for CNNs employed for feature extraction or representation learning in surgical phase recognition. In this study, we propose a method for representation learning in surgical workflow analysis using a vision-language model (ReSW-VL). Our proposed method involves fine-tuning the image encoder of a CLIP (Convolutional Language Image Model) vision-language model using prompt learning for surgical phase recognition. The experimental results on three surgical phase recognition datasets demonstrate the effectiveness of the proposed method in comparison to conventional methods.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_13746
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model Kondo, Satoshi Computer Vision and Pattern Recognition Image and Video Processing Surgical phase recognition from video is a technology that automatically classifies the progress of a surgical procedure and has a wide range of potential applications, including real-time surgical support, optimization of medical resources, training and skill assessment, and safety improvement. Recent advances in surgical phase recognition technology have focused primarily on Transform-based methods, although methods that extract spatial features from individual frames using a CNN and video features from the resulting time series of spatial features using time series modeling have shown high performance. However, there remains a paucity of research on training methods for CNNs employed for feature extraction or representation learning in surgical phase recognition. In this study, we propose a method for representation learning in surgical workflow analysis using a vision-language model (ReSW-VL). Our proposed method involves fine-tuning the image encoder of a CLIP (Convolutional Language Image Model) vision-language model using prompt learning for surgical phase recognition. The experimental results on three surgical phase recognition datasets demonstrate the effectiveness of the proposed method in comparison to conventional methods.
title	ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model
topic	Computer Vision and Pattern Recognition Image and Video Processing
url	https://arxiv.org/abs/2505.13746

Similar Items