Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cheng, Jintao, Li, Weibin, Luo, Jiehao, Tang, Xiaoyu, He, Zhijian, Wu, Jin, Zou, Yao, Zhang, Wei
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.02129
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908515314958336
author	Cheng, Jintao Li, Weibin Luo, Jiehao Tang, Xiaoyu He, Zhijian Wu, Jin Zou, Yao Zhang, Wei
author_facet	Cheng, Jintao Li, Weibin Luo, Jiehao Tang, Xiaoyu He, Zhijian Wu, Jin Zou, Yao Zhang, Wei
contents	Visual Place Recognition (VPR) has evolved from handcrafted descriptors to deep learning approaches, yet significant challenges remain. Current approaches, including Vision Foundation Models (VFMs) and Multimodal Large Language Models (MLLMs), enhance semantic understanding but suffer from high computational overhead and limited cross-domain transferability when fine-tuned. To address these limitations, we propose a novel zero-shot framework employing Test-Time Scaling (TTS) that leverages MLLMs' vision-language alignment capabilities through Guidance-based methods for direct similarity scoring. Our approach eliminates two-stage processing by employing structured prompts that generate length-controllable JSON outputs. The TTS framework with Uncertainty-Aware Self-Consistency (UASC) enables real-time adaptation without additional training costs, achieving superior generalization across diverse environments. Experimental results demonstrate significant improvements in cross-domain VPR performance with up to 210$\times$ computational efficiency gains.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_02129
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Scale, Don't Fine-tune: Guiding Multimodal LLMs for Efficient Visual Place Recognition at Test-Time Cheng, Jintao Li, Weibin Luo, Jiehao Tang, Xiaoyu He, Zhijian Wu, Jin Zou, Yao Zhang, Wei Machine Learning Computer Vision and Pattern Recognition Visual Place Recognition (VPR) has evolved from handcrafted descriptors to deep learning approaches, yet significant challenges remain. Current approaches, including Vision Foundation Models (VFMs) and Multimodal Large Language Models (MLLMs), enhance semantic understanding but suffer from high computational overhead and limited cross-domain transferability when fine-tuned. To address these limitations, we propose a novel zero-shot framework employing Test-Time Scaling (TTS) that leverages MLLMs' vision-language alignment capabilities through Guidance-based methods for direct similarity scoring. Our approach eliminates two-stage processing by employing structured prompts that generate length-controllable JSON outputs. The TTS framework with Uncertainty-Aware Self-Consistency (UASC) enables real-time adaptation without additional training costs, achieving superior generalization across diverse environments. Experimental results demonstrate significant improvements in cross-domain VPR performance with up to 210$\times$ computational efficiency gains.
title	Scale, Don't Fine-tune: Guiding Multimodal LLMs for Efficient Visual Place Recognition at Test-Time
topic	Machine Learning Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2509.02129

Similar Items