Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.07415 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908443162443776 |
|---|---|
| author | Yu, Xinyao Sun, Hao Ling, Zeyu Niu, Ziwei Bai, Zhenjia Qin, Rui Chen, Yen-Wei Lin, Lanfen |
| author_facet | Yu, Xinyao Sun, Hao Ling, Zeyu Niu, Ziwei Bai, Zhenjia Qin, Rui Chen, Yen-Wei Lin, Lanfen |
| contents | In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1\% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2507_07415 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | EPIC: Efficient Prompt Interaction for Text-Image Classification Yu, Xinyao Sun, Hao Ling, Zeyu Niu, Ziwei Bai, Zhenjia Qin, Rui Chen, Yen-Wei Lin, Lanfen Computer Vision and Pattern Recognition In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1\% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset. |
| title | EPIC: Efficient Prompt Interaction for Text-Image Classification |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2507.07415 |