Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hu, Ming, Yuan, Kun, Shen, Yaling, Tang, Feilong, Xu, Xiaohao, Zhou, Lin, Li, Wei, Chen, Ying, Xu, Zhongxing, Peng, Zelin, Yan, Siyuan, Srivastav, Vinkle, Song, Diping, Li, Tianbin, Shi, Danli, Ye, Jin, Padoy, Nicolas, Navab, Nassir, He, Junjun, Ge, Zongyuan
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2411.15421
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917849509920768
author	Hu, Ming Yuan, Kun Shen, Yaling Tang, Feilong Xu, Xiaohao Zhou, Lin Li, Wei Chen, Ying Xu, Zhongxing Peng, Zelin Yan, Siyuan Srivastav, Vinkle Song, Diping Li, Tianbin Shi, Danli Ye, Jin Padoy, Nicolas Navab, Nassir He, Junjun Ge, Zongyuan
author_facet	Hu, Ming Yuan, Kun Shen, Yaling Tang, Feilong Xu, Xiaohao Zhou, Lin Li, Wei Chen, Ying Xu, Zhongxing Peng, Zelin Yan, Siyuan Srivastav, Vinkle Song, Diping Li, Tianbin Shi, Danli Ye, Jin Padoy, Nicolas Navab, Nassir He, Junjun Ge, Zongyuan
contents	Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_15421
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining Hu, Ming Yuan, Kun Shen, Yaling Tang, Feilong Xu, Xiaohao Zhou, Lin Li, Wei Chen, Ying Xu, Zhongxing Peng, Zelin Yan, Siyuan Srivastav, Vinkle Song, Diping Li, Tianbin Shi, Danli Ye, Jin Padoy, Nicolas Navab, Nassir He, Junjun Ge, Zongyuan Computer Vision and Pattern Recognition Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.
title	OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2411.15421

Similar Items