Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yu, Xinyao, Sun, Hao, Ling, Zeyu, Niu, Ziwei, Bai, Zhenjia, Qin, Rui, Chen, Yen-Wei, Lin, Lanfen
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.07415
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908443162443776
author	Yu, Xinyao Sun, Hao Ling, Zeyu Niu, Ziwei Bai, Zhenjia Qin, Rui Chen, Yen-Wei Lin, Lanfen
author_facet	Yu, Xinyao Sun, Hao Ling, Zeyu Niu, Ziwei Bai, Zhenjia Qin, Rui Chen, Yen-Wei Lin, Lanfen
contents	In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1\% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_07415
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	EPIC: Efficient Prompt Interaction for Text-Image Classification Yu, Xinyao Sun, Hao Ling, Zeyu Niu, Ziwei Bai, Zhenjia Qin, Rui Chen, Yen-Wei Lin, Lanfen Computer Vision and Pattern Recognition In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1\% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.
title	EPIC: Efficient Prompt Interaction for Text-Image Classification
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2507.07415

Similar Items