Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Maharana, Sarthak Kumar, Mehra, Akshay, Ramakrishna, Bhavya, Guo, Yunhui, Su, Guan-Ming
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Sound
Online Access:	https://arxiv.org/abs/2602.18528
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910028807536640
author	Maharana, Sarthak Kumar Mehra, Akshay Ramakrishna, Bhavya Guo, Yunhui Su, Guan-Ming
author_facet	Maharana, Sarthak Kumar Mehra, Akshay Ramakrishna, Bhavya Guo, Yunhui Su, Guan-Ming
contents	Audio-visual continual test-time adaptation involves continually adapting a source audio-visual model at test-time, to unlabeled non-stationary domains, where either or both modalities can be distributionally shifted, which hampers online cross-modal learning and eventually leads to poor accuracy. While previous works have tackled this problem, we find that SOTA methods suffer from catastrophic forgetting, where the model's performance drops well below the source model due to continual parameter updates at test-time. In this work, we first show that adapting only the modality fusion layer to a target domain not only improves performance on that domain but can also enhance performance on subsequent domains. Based on this strong cross-task transferability of the fusion layer's parameters, we propose a method, $\texttt{AV-CTTA}$, that improves test-time performance of the models without access to any source data. Our approach works by using a selective parameter retrieval mechanism that dynamically retrieves the best fusion layer parameters from a buffer using only a small batch of test data. These parameters are then integrated into the model, adapted to the current test distribution, and saved back for future use. Extensive experiments on benchmark datasets involving unimodal and bimodal corruptions show our proposed $\texttt{AV-CTTA}$ significantly outperforms existing methods while minimizing catastrophic forgetting.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_18528
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Audio-Visual Continual Test-Time Adaptation without Forgetting Maharana, Sarthak Kumar Mehra, Akshay Ramakrishna, Bhavya Guo, Yunhui Su, Guan-Ming Machine Learning Sound Audio-visual continual test-time adaptation involves continually adapting a source audio-visual model at test-time, to unlabeled non-stationary domains, where either or both modalities can be distributionally shifted, which hampers online cross-modal learning and eventually leads to poor accuracy. While previous works have tackled this problem, we find that SOTA methods suffer from catastrophic forgetting, where the model's performance drops well below the source model due to continual parameter updates at test-time. In this work, we first show that adapting only the modality fusion layer to a target domain not only improves performance on that domain but can also enhance performance on subsequent domains. Based on this strong cross-task transferability of the fusion layer's parameters, we propose a method, $\texttt{AV-CTTA}$, that improves test-time performance of the models without access to any source data. Our approach works by using a selective parameter retrieval mechanism that dynamically retrieves the best fusion layer parameters from a buffer using only a small batch of test data. These parameters are then integrated into the model, adapted to the current test distribution, and saved back for future use. Extensive experiments on benchmark datasets involving unimodal and bimodal corruptions show our proposed $\texttt{AV-CTTA}$ significantly outperforms existing methods while minimizing catastrophic forgetting.
title	Audio-Visual Continual Test-Time Adaptation without Forgetting
topic	Machine Learning Sound
url	https://arxiv.org/abs/2602.18528

Similar Items