Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Haiwen, Liu, Delong, Hou, Zhaohui, Zhao, Zhicheng, Su, Fei
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.05970
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908589526876160
author	Li, Haiwen Liu, Delong Hou, Zhaohui Zhao, Zhicheng Su, Fei
author_facet	Li, Haiwen Liu, Delong Hou, Zhaohui Zhao, Zhicheng Su, Fei
contents	As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_05970
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval Li, Haiwen Liu, Delong Hou, Zhaohui Zhao, Zhicheng Su, Fei Computer Vision and Pattern Recognition As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.
title	Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2507.05970

Similar Items