Saved in:
Bibliographic Details
Main Authors: Cheng, Jiacheng, Shin, Hijung Valentina, Vasconcelos, Nuno, Russell, Bryan, Heilbron, Fabian Caba
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2405.03190
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916236508528640
author Cheng, Jiacheng
Shin, Hijung Valentina
Vasconcelos, Nuno
Russell, Bryan
Heilbron, Fabian Caba
author_facet Cheng, Jiacheng
Shin, Hijung Valentina
Vasconcelos, Nuno
Russell, Bryan
Heilbron, Fabian Caba
contents In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
format Preprint
id arxiv_https___arxiv_org_abs_2405_03190
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
Cheng, Jiacheng
Shin, Hijung Valentina
Vasconcelos, Nuno
Russell, Bryan
Heilbron, Fabian Caba
Computer Vision and Pattern Recognition
In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
title Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2405.03190