Saved in:
Bibliographic Details
Main Authors: Maharana, Sarthak Kumar, Mehra, Akshay, Ramakrishna, Bhavya, Guo, Yunhui, Su, Guan-Ming
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.18528
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910028807536640
author Maharana, Sarthak Kumar
Mehra, Akshay
Ramakrishna, Bhavya
Guo, Yunhui
Su, Guan-Ming
author_facet Maharana, Sarthak Kumar
Mehra, Akshay
Ramakrishna, Bhavya
Guo, Yunhui
Su, Guan-Ming
contents Audio-visual continual test-time adaptation involves continually adapting a source audio-visual model at test-time, to unlabeled non-stationary domains, where either or both modalities can be distributionally shifted, which hampers online cross-modal learning and eventually leads to poor accuracy. While previous works have tackled this problem, we find that SOTA methods suffer from catastrophic forgetting, where the model's performance drops well below the source model due to continual parameter updates at test-time. In this work, we first show that adapting only the modality fusion layer to a target domain not only improves performance on that domain but can also enhance performance on subsequent domains. Based on this strong cross-task transferability of the fusion layer's parameters, we propose a method, $\texttt{AV-CTTA}$, that improves test-time performance of the models without access to any source data. Our approach works by using a selective parameter retrieval mechanism that dynamically retrieves the best fusion layer parameters from a buffer using only a small batch of test data. These parameters are then integrated into the model, adapted to the current test distribution, and saved back for future use. Extensive experiments on benchmark datasets involving unimodal and bimodal corruptions show our proposed $\texttt{AV-CTTA}$ significantly outperforms existing methods while minimizing catastrophic forgetting.
format Preprint
id arxiv_https___arxiv_org_abs_2602_18528
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Audio-Visual Continual Test-Time Adaptation without Forgetting
Maharana, Sarthak Kumar
Mehra, Akshay
Ramakrishna, Bhavya
Guo, Yunhui
Su, Guan-Ming
Machine Learning
Sound
Audio-visual continual test-time adaptation involves continually adapting a source audio-visual model at test-time, to unlabeled non-stationary domains, where either or both modalities can be distributionally shifted, which hampers online cross-modal learning and eventually leads to poor accuracy. While previous works have tackled this problem, we find that SOTA methods suffer from catastrophic forgetting, where the model's performance drops well below the source model due to continual parameter updates at test-time. In this work, we first show that adapting only the modality fusion layer to a target domain not only improves performance on that domain but can also enhance performance on subsequent domains. Based on this strong cross-task transferability of the fusion layer's parameters, we propose a method, $\texttt{AV-CTTA}$, that improves test-time performance of the models without access to any source data. Our approach works by using a selective parameter retrieval mechanism that dynamically retrieves the best fusion layer parameters from a buffer using only a small batch of test data. These parameters are then integrated into the model, adapted to the current test distribution, and saved back for future use. Extensive experiments on benchmark datasets involving unimodal and bimodal corruptions show our proposed $\texttt{AV-CTTA}$ significantly outperforms existing methods while minimizing catastrophic forgetting.
title Audio-Visual Continual Test-Time Adaptation without Forgetting
topic Machine Learning
Sound
url https://arxiv.org/abs/2602.18528