Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhu, Yuhan, Zeng, Xiangyu, Wang, Chenting, Li, Xinhao, Liu, Chunxu, Xu, Yicheng, Yan, Ziang, Wang, Yi, Wang, Limin
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.24621
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914598477627392
author	Zhu, Yuhan Zeng, Xiangyu Wang, Chenting Li, Xinhao Liu, Chunxu Xu, Yicheng Yan, Ziang Wang, Yi Wang, Limin
author_facet	Zhu, Yuhan Zeng, Xiangyu Wang, Chenting Li, Xinhao Liu, Chunxu Xu, Yicheng Yan, Ziang Wang, Yi Wang, Limin
contents	Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_24621
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	FreeRet: MLLMs as Training-Free Retrievers Zhu, Yuhan Zeng, Xiangyu Wang, Chenting Li, Xinhao Liu, Chunxu Xu, Yicheng Yan, Ziang Wang, Yi Wang, Limin Computer Vision and Pattern Recognition Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.
title	FreeRet: MLLMs as Training-Free Retrievers
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2509.24621

Similar Items