Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gu, Tiancheng, Yang, Kaicheng, Feng, Ziyong, Wang, Xingjun, Zhang, Yanzhao, Long, Dingkun, Chen, Yingda, Cai, Weidong, Deng, Jiankang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2504.17432
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912752714383360
author	Gu, Tiancheng Yang, Kaicheng Feng, Ziyong Wang, Xingjun Zhang, Yanzhao Long, Dingkun Chen, Yingda Cai, Weidong Deng, Jiankang
author_facet	Gu, Tiancheng Yang, Kaicheng Feng, Ziyong Wang, Xingjun Zhang, Yanzhao Long, Dingkun Chen, Yingda Cai, Weidong Deng, Jiankang
contents	The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLMś language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_17432
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs Gu, Tiancheng Yang, Kaicheng Feng, Ziyong Wang, Xingjun Zhang, Yanzhao Long, Dingkun Chen, Yingda Cai, Weidong Deng, Jiankang Computer Vision and Pattern Recognition The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLMś language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.
title	Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2504.17432

Similar Items