Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Cheng-En, Lin, Jinhong, Hu, Yu Hen, Morgado, Pedro
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2409.14607
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915038746378240
author	Wu, Cheng-En Lin, Jinhong Hu, Yu Hen Morgado, Pedro
author_facet	Wu, Cheng-En Lin, Jinhong Hu, Yu Hen Morgado, Pedro
contents	Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP and identifying the optimal subset of tokens for maximum performance. To address this, we propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking. To compensate for any performance degradation resulting from token pruning, we incorporate learnable visual tokens that aid in restoring and potentially enhancing the model's performance. Our work presents a comprehensive and systematic investigation of pruning tokens within the ViT backbone of CLIP models. Through our framework, we successfully reduced 40% of patch tokens in CLIP's ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets. Our study lays the groundwork for building more computationally efficient multimodal models without sacrificing their performance, addressing a key challenge in the application of advanced vision-language models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_14607
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Patch Ranking: Efficient CLIP by Learning to Rank Local Patches Wu, Cheng-En Lin, Jinhong Hu, Yu Hen Morgado, Pedro Computer Vision and Pattern Recognition Machine Learning Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP and identifying the optimal subset of tokens for maximum performance. To address this, we propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking. To compensate for any performance degradation resulting from token pruning, we incorporate learnable visual tokens that aid in restoring and potentially enhancing the model's performance. Our work presents a comprehensive and systematic investigation of pruning tokens within the ViT backbone of CLIP models. Through our framework, we successfully reduced 40% of patch tokens in CLIP's ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets. Our study lays the groundwork for building more computationally efficient multimodal models without sacrificing their performance, addressing a key challenge in the application of advanced vision-language models.
title	Patch Ranking: Efficient CLIP by Learning to Rank Local Patches
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2409.14607

Similar Items