Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lou, Haoran, Liu, Ziyan, Fan, Chunxiao, Wu, Yuexin, Ming, Yue, Wu, Hao, Zuo, Kai, Chen, Yibo, Tang, Xu
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.13710
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913105918820352
author	Lou, Haoran Liu, Ziyan Fan, Chunxiao Wu, Yuexin Ming, Yue Wu, Hao Zuo, Kai Chen, Yibo Tang, Xu
author_facet	Lou, Haoran Liu, Ziyan Fan, Chunxiao Wu, Yuexin Ming, Yue Wu, Hao Zuo, Kai Chen, Yibo Tang, Xu
contents	Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench, validating that preserving the pre-trained representations via non-invasive adaptation is an effective strategy for MLLM-based retrieval. The code is available under: https://github.com/CnFaker/SLQ.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_13710
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs Lou, Haoran Liu, Ziyan Fan, Chunxiao Wu, Yuexin Ming, Yue Wu, Hao Zuo, Kai Chen, Yibo Tang, Xu Computer Vision and Pattern Recognition Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench, validating that preserving the pre-trained representations via non-invasive adaptation is an effective strategy for MLLM-based retrieval. The code is available under: https://github.com/CnFaker/SLQ.
title	SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.13710

Similar Items