Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Vilouras, Konstantinos, Sanchez, Pedro, O'Neil, Alison Q., Tsaftaris, Sotirios A.
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2404.12920
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915128556912640
author	Vilouras, Konstantinos Sanchez, Pedro O'Neil, Alison Q. Tsaftaris, Sotirios A.
author_facet	Vilouras, Konstantinos Sanchez, Pedro O'Neil, Alison Q. Tsaftaris, Sotirios A.
contents	Localizing the exact pathological regions in a given medical scan is an important imaging problem that traditionally requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to perform this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains cross-attention mechanisms that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any training on the target task, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive with SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance at https://github.com/vios-s.
format	Preprint
id	arxiv_https___arxiv_org_abs_2404_12920
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models Vilouras, Konstantinos Sanchez, Pedro O'Neil, Alison Q. Tsaftaris, Sotirios A. Computer Vision and Pattern Recognition Machine Learning Localizing the exact pathological regions in a given medical scan is an important imaging problem that traditionally requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to perform this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains cross-attention mechanisms that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any training on the target task, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive with SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance at https://github.com/vios-s.
title	Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2404.12920

Similar Items