Saved in:
Bibliographic Details
Main Authors: Hu, Ming, Yuan, Kun, Shen, Yaling, Tang, Feilong, Xu, Xiaohao, Zhou, Lin, Li, Wei, Chen, Ying, Xu, Zhongxing, Peng, Zelin, Yan, Siyuan, Srivastav, Vinkle, Song, Diping, Li, Tianbin, Shi, Danli, Ye, Jin, Padoy, Nicolas, Navab, Nassir, He, Junjun, Ge, Zongyuan
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2411.15421
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917849509920768
author Hu, Ming
Yuan, Kun
Shen, Yaling
Tang, Feilong
Xu, Xiaohao
Zhou, Lin
Li, Wei
Chen, Ying
Xu, Zhongxing
Peng, Zelin
Yan, Siyuan
Srivastav, Vinkle
Song, Diping
Li, Tianbin
Shi, Danli
Ye, Jin
Padoy, Nicolas
Navab, Nassir
He, Junjun
Ge, Zongyuan
author_facet Hu, Ming
Yuan, Kun
Shen, Yaling
Tang, Feilong
Xu, Xiaohao
Zhou, Lin
Li, Wei
Chen, Ying
Xu, Zhongxing
Peng, Zelin
Yan, Siyuan
Srivastav, Vinkle
Song, Diping
Li, Tianbin
Shi, Danli
Ye, Jin
Padoy, Nicolas
Navab, Nassir
He, Junjun
Ge, Zongyuan
contents Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.
format Preprint
id arxiv_https___arxiv_org_abs_2411_15421
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
Hu, Ming
Yuan, Kun
Shen, Yaling
Tang, Feilong
Xu, Xiaohao
Zhou, Lin
Li, Wei
Chen, Ying
Xu, Zhongxing
Peng, Zelin
Yan, Siyuan
Srivastav, Vinkle
Song, Diping
Li, Tianbin
Shi, Danli
Ye, Jin
Padoy, Nicolas
Navab, Nassir
He, Junjun
Ge, Zongyuan
Computer Vision and Pattern Recognition
Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.
title OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2411.15421