Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Ziliang, Huang, Xin, Guan, Quanlong, Lin, Liang, Luo, Weiqi
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2511.00191
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917053570482176
author	Chen, Ziliang Huang, Xin Guan, Quanlong Lin, Liang Luo, Weiqi
author_facet	Chen, Ziliang Huang, Xin Guan, Quanlong Lin, Liang Luo, Weiqi
contents	The vision community is undergoing the unprecedented progress with the emergence of Vision-Language Pretraining Models (VLMs). Prompt learning plays as the holy grail of accessing VLMs since it enables their fast adaptation to downstream tasks with limited resources. Whereas existing researches milling around single-prompt paradigms, rarely investigate the technical potential behind their multi-prompt learning counterparts. This paper aims to provide a principled retrospect for vision-language multi-prompt learning. We extend the recent constant modality gap phenomenon to learnable prompts and then, justify the superiority of vision-language transfer with multi-prompt augmentation, empirically and theoretically. In terms of this observation, we propose an Energy-based Multi-prompt Learning (EMPL) to generate multiple prompt embeddings by drawing instances from an energy-based distribution, which is implicitly defined by VLMs. So our EMPL is not only parameter-efficient but also rigorously lead to the balance between in-domain and out-of-domain open-vocabulary generalization. Comprehensive experiments have been conducted to justify our claims and the excellence of EMPL.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_00191
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	A Retrospect to Multi-prompt Learning across Vision and Language Chen, Ziliang Huang, Xin Guan, Quanlong Lin, Liang Luo, Weiqi Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning The vision community is undergoing the unprecedented progress with the emergence of Vision-Language Pretraining Models (VLMs). Prompt learning plays as the holy grail of accessing VLMs since it enables their fast adaptation to downstream tasks with limited resources. Whereas existing researches milling around single-prompt paradigms, rarely investigate the technical potential behind their multi-prompt learning counterparts. This paper aims to provide a principled retrospect for vision-language multi-prompt learning. We extend the recent constant modality gap phenomenon to learnable prompts and then, justify the superiority of vision-language transfer with multi-prompt augmentation, empirically and theoretically. In terms of this observation, we propose an Energy-based Multi-prompt Learning (EMPL) to generate multiple prompt embeddings by drawing instances from an energy-based distribution, which is implicitly defined by VLMs. So our EMPL is not only parameter-efficient but also rigorously lead to the balance between in-domain and out-of-domain open-vocabulary generalization. Comprehensive experiments have been conducted to justify our claims and the excellence of EMPL.
title	A Retrospect to Multi-prompt Learning across Vision and Language
topic	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2511.00191

Similar Items