Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.13710 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913105918820352 |
|---|---|
| author | Lou, Haoran Liu, Ziyan Fan, Chunxiao Wu, Yuexin Ming, Yue Wu, Hao Zuo, Kai Chen, Yibo Tang, Xu |
| author_facet | Lou, Haoran Liu, Ziyan Fan, Chunxiao Wu, Yuexin Ming, Yue Wu, Hao Zuo, Kai Chen, Yibo Tang, Xu |
| contents | Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench, validating that preserving the pre-trained representations via non-invasive adaptation is an effective strategy for MLLM-based retrieval. The code is available under: https://github.com/CnFaker/SLQ. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_13710 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs Lou, Haoran Liu, Ziyan Fan, Chunxiao Wu, Yuexin Ming, Yue Wu, Hao Zuo, Kai Chen, Yibo Tang, Xu Computer Vision and Pattern Recognition Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench, validating that preserving the pre-trained representations via non-invasive adaptation is an effective strategy for MLLM-based retrieval. The code is available under: https://github.com/CnFaker/SLQ. |
| title | SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2604.13710 |