Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Han, Cheng, Wang, Qifan, Cui, Yiming, Wang, Wenguan, Huang, Lifu, Qi, Siyuan, Liu, Dongfang
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2401.12902
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913205829238784
author	Han, Cheng Wang, Qifan Cui, Yiming Wang, Wenguan Huang, Lifu Qi, Siyuan Liu, Dongfang
author_facet	Han, Cheng Wang, Qifan Cui, Yiming Wang, Wenguan Huang, Lifu Qi, Siyuan Liu, Dongfang
contents	As the scale of vision models continues to grow, the emergence of Visual Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has gained attention due to its superior performance compared to traditional full-finetuning. However, the conditions favoring VPT (the ``when") and the underlying rationale (the ``why") remain unclear. In this paper, we conduct a comprehensive analysis across 19 distinct datasets and tasks. To understand the ``when" aspect, we identify the scenarios where VPT proves favorable by two dimensions: task objectives and data distributions. We find that VPT is preferrable when there is 1) a substantial disparity between the original and the downstream task objectives (e.g., transitioning from classification to counting), or 2) a similarity in data distributions between the two tasks (e.g., both involve natural images). In exploring the ``why" dimension, our results indicate VPT's success cannot be attributed solely to overfitting and optimization considerations. The unique way VPT preserves original features and adds parameters appears to be a pivotal factor. Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_12902
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning? Han, Cheng Wang, Qifan Cui, Yiming Wang, Wenguan Huang, Lifu Qi, Siyuan Liu, Dongfang Computer Vision and Pattern Recognition As the scale of vision models continues to grow, the emergence of Visual Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has gained attention due to its superior performance compared to traditional full-finetuning. However, the conditions favoring VPT (the ``when") and the underlying rationale (the ``why") remain unclear. In this paper, we conduct a comprehensive analysis across 19 distinct datasets and tasks. To understand the ``when" aspect, we identify the scenarios where VPT proves favorable by two dimensions: task objectives and data distributions. We find that VPT is preferrable when there is 1) a substantial disparity between the original and the downstream task objectives (e.g., transitioning from classification to counting), or 2) a similarity in data distributions between the two tasks (e.g., both involve natural images). In exploring the ``why" dimension, our results indicate VPT's success cannot be attributed solely to overfitting and optimization considerations. The unique way VPT preserves original features and adds parameters appears to be a pivotal factor. Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.
title	Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2401.12902

Similar Items