Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gu, Geonmo, Heo, Byeongho, Yu, Jaemyung, Hwang, Jaehui, Kim, Taekyung, Lee, Sangmin, Jun, HeeJae, Kang, Yoohoon, Yun, Sangdoo, Han, Dongyoon
Format:	Preprint
Published:	2026
Subjects:	Information Retrieval
Online Access:	https://arxiv.org/abs/2602.06393
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917379784572928
author	Gu, Geonmo Heo, Byeongho Yu, Jaemyung Hwang, Jaehui Kim, Taekyung Lee, Sangmin Jun, HeeJae Kang, Yoohoon Yun, Sangdoo Han, Dongyoon
author_facet	Gu, Geonmo Heo, Byeongho Yu, Jaemyung Hwang, Jaehui Kim, Taekyung Lee, Sangmin Jun, HeeJae Kang, Yoohoon Yun, Sangdoo Han, Dongyoon
contents	Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities. Code and M3T are available at https://github.com/naver-ai/muco
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_06393
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model Gu, Geonmo Heo, Byeongho Yu, Jaemyung Hwang, Jaehui Kim, Taekyung Lee, Sangmin Jun, HeeJae Kang, Yoohoon Yun, Sangdoo Han, Dongyoon Information Retrieval Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a "single-turn" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities. Code and M3T are available at https://github.com/naver-ai/muco
title	MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
topic	Information Retrieval
url	https://arxiv.org/abs/2602.06393

Similar Items