Saved in:
Bibliographic Details
Main Authors: Sun, Yingfei, Gu, Xu, Ji, Wei, Zhao, Hanbin, Yin, Yifang, Zimmermann, Roger
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.04258
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909707643387904
author Sun, Yingfei
Gu, Xu
Ji, Wei
Zhao, Hanbin
Yin, Yifang
Zimmermann, Roger
author_facet Sun, Yingfei
Gu, Xu
Ji, Wei
Zhao, Hanbin
Yin, Yifang
Zimmermann, Roger
contents Many studies combine text and audio to capture multi-modal information but they overlook the model's generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called Text-Audio Incremental Learning (TAIL) task for text-audio retrieval, and propose a new method, PTAT, Prompt Tuning for Audio-Text incremental learning. This method utilizes prompt tuning to optimize the model parameters while incorporating an audio-text similarity and feature distillation module to effectively mitigate catastrophic forgetting. We benchmark our method and previous incremental learning methods on AudioCaps, Clotho, BBC Sound Effects and Audioset datasets, and our method outperforms previous methods significantly, particularly demonstrating stronger resistance to forgetting on older datasets. Compared to the full-parameters Finetune (Sequential) method, our model only requires 2.42\% of its parameters, achieving 4.46\% higher performance.
format Preprint
id arxiv_https___arxiv_org_abs_2503_04258
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle TAIL: Text-Audio Incremental Learning
Sun, Yingfei
Gu, Xu
Ji, Wei
Zhao, Hanbin
Yin, Yifang
Zimmermann, Roger
Sound
Artificial Intelligence
Computer Vision and Pattern Recognition
Audio and Speech Processing
I.2
Many studies combine text and audio to capture multi-modal information but they overlook the model's generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called Text-Audio Incremental Learning (TAIL) task for text-audio retrieval, and propose a new method, PTAT, Prompt Tuning for Audio-Text incremental learning. This method utilizes prompt tuning to optimize the model parameters while incorporating an audio-text similarity and feature distillation module to effectively mitigate catastrophic forgetting. We benchmark our method and previous incremental learning methods on AudioCaps, Clotho, BBC Sound Effects and Audioset datasets, and our method outperforms previous methods significantly, particularly demonstrating stronger resistance to forgetting on older datasets. Compared to the full-parameters Finetune (Sequential) method, our model only requires 2.42\% of its parameters, achieving 4.46\% higher performance.
title TAIL: Text-Audio Incremental Learning
topic Sound
Artificial Intelligence
Computer Vision and Pattern Recognition
Audio and Speech Processing
I.2
url https://arxiv.org/abs/2503.04258