Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cheng, Jiacheng, Shin, Hijung Valentina, Vasconcelos, Nuno, Russell, Bryan, Heilbron, Fabian Caba
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2405.03190
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916236508528640
author	Cheng, Jiacheng Shin, Hijung Valentina Vasconcelos, Nuno Russell, Bryan Heilbron, Fabian Caba
author_facet	Cheng, Jiacheng Shin, Hijung Valentina Vasconcelos, Nuno Russell, Bryan Heilbron, Fabian Caba
contents	In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
format	Preprint
id	arxiv_https___arxiv_org_abs_2405_03190
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval Cheng, Jiacheng Shin, Hijung Valentina Vasconcelos, Nuno Russell, Bryan Heilbron, Fabian Caba Computer Vision and Pattern Recognition In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
title	Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2405.03190

Similar Items