Saved in:
Bibliographic Details
Main Authors: Barreto, Jesimon, Caetano, Carlos, Araujo, André, Schwartz, William Robson
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.20994
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908609176141824
author Barreto, Jesimon
Caetano, Carlos
Araujo, André
Schwartz, William Robson
author_facet Barreto, Jesimon
Caetano, Carlos
Araujo, André
Schwartz, William Robson
contents Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.
format Preprint
id arxiv_https___arxiv_org_abs_2510_20994
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models
Barreto, Jesimon
Caetano, Carlos
Araujo, André
Schwartz, William Robson
Computer Vision and Pattern Recognition
Artificial Intelligence
Machine Learning
Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.
title VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Machine Learning
url https://arxiv.org/abs/2510.20994