Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Taowen, Liu, Yiyang, Liang, James Chenhao, zhao, junhan, Cui, Yiming, Mao, Yuning, Nie, Shaoliang, Liu, Jiahao, Feng, Fuli, Xu, Zenglin, Han, Cheng, Huang, Lifu, Wang, Qifan, Liu, Dongfang
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2409.15657
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909371516059648
author	Wang, Taowen Liu, Yiyang Liang, James Chenhao zhao, junhan Cui, Yiming Mao, Yuning Nie, Shaoliang Liu, Jiahao Feng, Fuli Xu, Zenglin Han, Cheng Huang, Lifu Wang, Qifan Liu, Dongfang
author_facet	Wang, Taowen Liu, Yiyang Liang, James Chenhao zhao, junhan Cui, Yiming Mao, Yuning Nie, Shaoliang Liu, Jiahao Feng, Fuli Xu, Zenglin Han, Cheng Huang, Lifu Wang, Qifan Liu, Dongfang
contents	Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (M$^2$PT) approach for efficient instruction tuning of MLLMs. M$^2$PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_15657
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning Wang, Taowen Liu, Yiyang Liang, James Chenhao zhao, junhan Cui, Yiming Mao, Yuning Nie, Shaoliang Liu, Jiahao Feng, Fuli Xu, Zenglin Han, Cheng Huang, Lifu Wang, Qifan Liu, Dongfang Artificial Intelligence Computation and Language Machine Learning Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (M$^2$PT) approach for efficient instruction tuning of MLLMs. M$^2$PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.
title	M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning
topic	Artificial Intelligence Computation and Language Machine Learning
url	https://arxiv.org/abs/2409.15657

Similar Items