Saved in:
Bibliographic Details
Main Authors: Lou, Haoran, Liu, Ziyan, Fan, Chunxiao, Wu, Yuexin, Ming, Yue, Wu, Hao, Zuo, Kai, Chen, Yibo, Tang, Xu
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.13710
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913105918820352
author Lou, Haoran
Liu, Ziyan
Fan, Chunxiao
Wu, Yuexin
Ming, Yue
Wu, Hao
Zuo, Kai
Chen, Yibo
Tang, Xu
author_facet Lou, Haoran
Liu, Ziyan
Fan, Chunxiao
Wu, Yuexin
Ming, Yue
Wu, Hao
Zuo, Kai
Chen, Yibo
Tang, Xu
contents Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench, validating that preserving the pre-trained representations via non-invasive adaptation is an effective strategy for MLLM-based retrieval. The code is available under: https://github.com/CnFaker/SLQ.
format Preprint
id arxiv_https___arxiv_org_abs_2604_13710
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
Lou, Haoran
Liu, Ziyan
Fan, Chunxiao
Wu, Yuexin
Ming, Yue
Wu, Hao
Zuo, Kai
Chen, Yibo
Tang, Xu
Computer Vision and Pattern Recognition
Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench, validating that preserving the pre-trained representations via non-invasive adaptation is an effective strategy for MLLM-based retrieval. The code is available under: https://github.com/CnFaker/SLQ.
title SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2604.13710