Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wei, Jui-Chiang, Lin, Yi-Cheng, Ritter-Gutierrez, Fabian, Lee, Hung-yi
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2506.07237
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915336349024256
author	Wei, Jui-Chiang Lin, Yi-Cheng Ritter-Gutierrez, Fabian Lee, Hung-yi
author_facet	Wei, Jui-Chiang Lin, Yi-Cheng Ritter-Gutierrez, Fabian Lee, Hung-yi
contents	Real-world audio often mixes speech and music, yet models typically handle only one domain. This paper introduces a multi-teacher distillation framework that unifies speech and music models into a single one while significantly reducing model size. Our approach leverages the strengths of domain-specific teacher models, such as HuBERT for speech and MERT for music, and explores various strategies to balance both domains. Experiments across diverse tasks demonstrate that our model matches the performance of domain-specific models, showing the effectiveness of cross-domain distillation. Additionally, we conduct few-shot learning experiments, highlighting the need for general models in real-world scenarios where labeled data is limited. Our results show that our model not only performs on par with specialized models but also outperforms them in few-shot scenarios, proving that a cross-domain approach is essential and effective for diverse tasks with limited data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_07237
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Multi-Distillation from Speech and Music Representation Models Wei, Jui-Chiang Lin, Yi-Cheng Ritter-Gutierrez, Fabian Lee, Hung-yi Audio and Speech Processing Real-world audio often mixes speech and music, yet models typically handle only one domain. This paper introduces a multi-teacher distillation framework that unifies speech and music models into a single one while significantly reducing model size. Our approach leverages the strengths of domain-specific teacher models, such as HuBERT for speech and MERT for music, and explores various strategies to balance both domains. Experiments across diverse tasks demonstrate that our model matches the performance of domain-specific models, showing the effectiveness of cross-domain distillation. Additionally, we conduct few-shot learning experiments, highlighting the need for general models in real-world scenarios where labeled data is limited. Our results show that our model not only performs on par with specialized models but also outperforms them in few-shot scenarios, proving that a cross-domain approach is essential and effective for diverse tasks with limited data.
title	Multi-Distillation from Speech and Music Representation Models
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2506.07237

Similar Items