Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Barreto, Jesimon, Caetano, Carlos, Araujo, André, Schwartz, William Robson
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2510.20994
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908609176141824
author	Barreto, Jesimon Caetano, Carlos Araujo, André Schwartz, William Robson
author_facet	Barreto, Jesimon Caetano, Carlos Araujo, André Schwartz, William Robson
contents	Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_20994
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models Barreto, Jesimon Caetano, Carlos Araujo, André Schwartz, William Robson Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.
title	VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models
topic	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2510.20994

Similar Items