Saved in:
Bibliographic Details
Main Authors: Zhu, Yuhan, Zeng, Xiangyu, Wang, Chenting, Li, Xinhao, Liu, Chunxu, Xu, Yicheng, Yan, Ziang, Wang, Yi, Wang, Limin
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.24621
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914598477627392
author Zhu, Yuhan
Zeng, Xiangyu
Wang, Chenting
Li, Xinhao
Liu, Chunxu
Xu, Yicheng
Yan, Ziang
Wang, Yi
Wang, Limin
author_facet Zhu, Yuhan
Zeng, Xiangyu
Wang, Chenting
Li, Xinhao
Liu, Chunxu
Xu, Yicheng
Yan, Ziang
Wang, Yi
Wang, Limin
contents Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.
format Preprint
id arxiv_https___arxiv_org_abs_2509_24621
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle FreeRet: MLLMs as Training-Free Retrievers
Zhu, Yuhan
Zeng, Xiangyu
Wang, Chenting
Li, Xinhao
Liu, Chunxu
Xu, Yicheng
Yan, Ziang
Wang, Yi
Wang, Limin
Computer Vision and Pattern Recognition
Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.
title FreeRet: MLLMs as Training-Free Retrievers
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2509.24621