Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.15421 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917849509920768 |
|---|---|
| author | Hu, Ming Yuan, Kun Shen, Yaling Tang, Feilong Xu, Xiaohao Zhou, Lin Li, Wei Chen, Ying Xu, Zhongxing Peng, Zelin Yan, Siyuan Srivastav, Vinkle Song, Diping Li, Tianbin Shi, Danli Ye, Jin Padoy, Nicolas Navab, Nassir He, Junjun Ge, Zongyuan |
| author_facet | Hu, Ming Yuan, Kun Shen, Yaling Tang, Feilong Xu, Xiaohao Zhou, Lin Li, Wei Chen, Ying Xu, Zhongxing Peng, Zelin Yan, Siyuan Srivastav, Vinkle Song, Diping Li, Tianbin Shi, Danli Ye, Jin Padoy, Nicolas Navab, Nassir He, Junjun Ge, Zongyuan |
| contents | Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2411_15421 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining Hu, Ming Yuan, Kun Shen, Yaling Tang, Feilong Xu, Xiaohao Zhou, Lin Li, Wei Chen, Ying Xu, Zhongxing Peng, Zelin Yan, Siyuan Srivastav, Vinkle Song, Diping Li, Tianbin Shi, Danli Ye, Jin Padoy, Nicolas Navab, Nassir He, Junjun Ge, Zongyuan Computer Vision and Pattern Recognition Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance. |
| title | OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2411.15421 |