Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Lin, Liting, Fan, Heng, Zhang, Zhipeng, Wang, Yaowei, Xu, Yong, Ling, Haibin
Format:	Preprint
Publié:	2024
Sujets:	Computer Vision and Pattern Recognition
Accès en ligne:	https://arxiv.org/abs/2403.05231
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866913446355795968
author	Lin, Liting Fan, Heng Zhang, Zhipeng Wang, Yaowei Xu, Yong Ling, Haibin
author_facet	Lin, Liting Fan, Heng Zhang, Zhipeng Wang, Yaowei Xu, Yong Ling, Haibin
contents	Motivated by the Parameter-Efficient Fine-Tuning (PEFT) in large language models, we propose LoRAT, a method that unveils the power of large ViT model for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency, to the domain of visual tracking. However, unique challenges and potential domain gaps make this transfer not as easy as the first intuition. Firstly, a transformer-based tracker constructs unshared position embedding for template and search image. This poses a challenge for the transfer of LoRA, usually requiring consistency in the design when applied to the pre-trained backbone, to downstream tasks. Secondly, the inductive bias inherent in convolutional heads diminishes the effectiveness of parameter-efficient fine-tuning in tracking models. To overcome these limitations, we first decouple the position embeddings in transformer-based trackers into shared spatial ones and independent type ones. The shared embeddings, which describe the absolute coordinates of multi-resolution images (namely, the template and search images), are inherited from the pre-trained backbones. In contrast, the independent embeddings indicate the sources of each token and are learned from scratch. Furthermore, we design an anchor-free head solely based on MLP to adapt PETR, enabling better performance with less computational overhead. With our design, 1) it becomes practical to train trackers with the ViT-g backbone on GPUs with only memory of 25.8GB (batch size of 16); 2) we reduce the training time of the L-224 variant from 35.0 to 10.8 GPU hours; 3) we improve the LaSOT SUC score from 0.703 to 0.742 with the L-224 variant; 4) we fast the inference speed of the L-224 variant from 52 to 119 FPS. Code and models are available at https://github.com/LitingLin/LoRAT.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_05231
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance Lin, Liting Fan, Heng Zhang, Zhipeng Wang, Yaowei Xu, Yong Ling, Haibin Computer Vision and Pattern Recognition Motivated by the Parameter-Efficient Fine-Tuning (PEFT) in large language models, we propose LoRAT, a method that unveils the power of large ViT model for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency, to the domain of visual tracking. However, unique challenges and potential domain gaps make this transfer not as easy as the first intuition. Firstly, a transformer-based tracker constructs unshared position embedding for template and search image. This poses a challenge for the transfer of LoRA, usually requiring consistency in the design when applied to the pre-trained backbone, to downstream tasks. Secondly, the inductive bias inherent in convolutional heads diminishes the effectiveness of parameter-efficient fine-tuning in tracking models. To overcome these limitations, we first decouple the position embeddings in transformer-based trackers into shared spatial ones and independent type ones. The shared embeddings, which describe the absolute coordinates of multi-resolution images (namely, the template and search images), are inherited from the pre-trained backbones. In contrast, the independent embeddings indicate the sources of each token and are learned from scratch. Furthermore, we design an anchor-free head solely based on MLP to adapt PETR, enabling better performance with less computational overhead. With our design, 1) it becomes practical to train trackers with the ViT-g backbone on GPUs with only memory of 25.8GB (batch size of 16); 2) we reduce the training time of the L-224 variant from 35.0 to 10.8 GPU hours; 3) we improve the LaSOT SUC score from 0.703 to 0.742 with the L-224 variant; 4) we fast the inference speed of the L-224 variant from 52 to 119 FPS. Code and models are available at https://github.com/LitingLin/LoRAT.
title	Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2403.05231

Documents similaires