Saved in:
Bibliographic Details
Main Authors: Wang, Taowen, Liu, Yiyang, Liang, James Chenhao, zhao, junhan, Cui, Yiming, Mao, Yuning, Nie, Shaoliang, Liu, Jiahao, Feng, Fuli, Xu, Zenglin, Han, Cheng, Huang, Lifu, Wang, Qifan, Liu, Dongfang
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.15657
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909371516059648
author Wang, Taowen
Liu, Yiyang
Liang, James Chenhao
zhao, junhan
Cui, Yiming
Mao, Yuning
Nie, Shaoliang
Liu, Jiahao
Feng, Fuli
Xu, Zenglin
Han, Cheng
Huang, Lifu
Wang, Qifan
Liu, Dongfang
author_facet Wang, Taowen
Liu, Yiyang
Liang, James Chenhao
zhao, junhan
Cui, Yiming
Mao, Yuning
Nie, Shaoliang
Liu, Jiahao
Feng, Fuli
Xu, Zenglin
Han, Cheng
Huang, Lifu
Wang, Qifan
Liu, Dongfang
contents Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (M$^2$PT) approach for efficient instruction tuning of MLLMs. M$^2$PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.
format Preprint
id arxiv_https___arxiv_org_abs_2409_15657
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning
Wang, Taowen
Liu, Yiyang
Liang, James Chenhao
zhao, junhan
Cui, Yiming
Mao, Yuning
Nie, Shaoliang
Liu, Jiahao
Feng, Fuli
Xu, Zenglin
Han, Cheng
Huang, Lifu
Wang, Qifan
Liu, Dongfang
Artificial Intelligence
Computation and Language
Machine Learning
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (M$^2$PT) approach for efficient instruction tuning of MLLMs. M$^2$PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.
title M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning
topic Artificial Intelligence
Computation and Language
Machine Learning
url https://arxiv.org/abs/2409.15657