Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kim, Sungkyung, Lee, Adam, Park, Junyoung, Chung, Andrew, Oh, Jusang, Lee, Jay-Yoon
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2410.09489
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929538688090112
author	Kim, Sungkyung Lee, Adam Park, Junyoung Chung, Andrew Oh, Jusang Lee, Jay-Yoon
author_facet	Kim, Sungkyung Lee, Adam Park, Junyoung Chung, Andrew Oh, Jusang Lee, Jay-Yoon
contents	Recent advancements in large language models have demonstrated enhanced capabilities in visual reasoning tasks by employing additional encoders for aligning different modalities. While the Q-Former has been widely used as a general encoder for aligning several modalities including image, video, audio, and 3D with large language models, previous works on its efficient training and the analysis of its individual components have been limited. In this work, we investigate the effectiveness of parameter efficient fine-tuning (PEFT) the Q-Former using InstructBLIP with visual reasoning benchmarks ScienceQA and IconQA. We observe that applying PEFT to the Q-Former achieves comparable performance to full fine-tuning using under 2% of the trainable parameters. Additionally, we employ AdaLoRA for dynamic parameter budget reallocation to examine the relative importance of the Q-Former's sublayers with 4 different benchmarks. Our findings reveal that the self-attention layers are noticeably more important in perceptual visual-language reasoning tasks, and relative importance of FFN layers depends on the complexity of visual-language patterns involved in tasks. The code is available at https://github.com/AttentionX/InstructBLIP_PEFT.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_09489
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks Kim, Sungkyung Lee, Adam Park, Junyoung Chung, Andrew Oh, Jusang Lee, Jay-Yoon Computation and Language Recent advancements in large language models have demonstrated enhanced capabilities in visual reasoning tasks by employing additional encoders for aligning different modalities. While the Q-Former has been widely used as a general encoder for aligning several modalities including image, video, audio, and 3D with large language models, previous works on its efficient training and the analysis of its individual components have been limited. In this work, we investigate the effectiveness of parameter efficient fine-tuning (PEFT) the Q-Former using InstructBLIP with visual reasoning benchmarks ScienceQA and IconQA. We observe that applying PEFT to the Q-Former achieves comparable performance to full fine-tuning using under 2% of the trainable parameters. Additionally, we employ AdaLoRA for dynamic parameter budget reallocation to examine the relative importance of the Q-Former's sublayers with 4 different benchmarks. Our findings reveal that the self-attention layers are noticeably more important in perceptual visual-language reasoning tasks, and relative importance of FFN layers depends on the complexity of visual-language patterns involved in tasks. The code is available at https://github.com/AttentionX/InstructBLIP_PEFT.
title	Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks
topic	Computation and Language
url	https://arxiv.org/abs/2410.09489

Similar Items