MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Luo, Xiaoxue, Huang, Jinwei, Yang, Runyan, Gao, Yingying, Feng, Junlan, Deng, Chao, Zhang, Shilei
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Sound
Accesso online:	https://arxiv.org/abs/2509.09201
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866911149298024448
author	Luo, Xiaoxue Huang, Jinwei Yang, Runyan Gao, Yingying Feng, Junlan Deng, Chao Zhang, Shilei
author_facet	Luo, Xiaoxue Huang, Jinwei Yang, Runyan Gao, Yingying Feng, Junlan Deng, Chao Zhang, Shilei
contents	Universal audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, often contains mixed speech and background sounds, and downstream tasks require selective access to these components. Therefore, we rethink the audio codec as a universal disentangled representation learner to enable controllable feature selection across different audio tasks. To this end, we introduce DeCodec, a novel neural codec that learns to decouple audio representations into orthogonal subspaces dedicated to speech and background sound, and within speech, representations are further decomposed into semantic and paralinguistic components. This hierarchical disentanglement allows flexible feature selection, making DeCodec a universal front-end for multiple audio applications. Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition. Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities: superior speech enhancement and effective one-shot voice conversion on noisy speech via representation recombination, improved ASR robustness through clean semantic features, and controllable background sound preservation/suppression in TTS. Demo Page: https://luo404.github.io/DeCodecV2/
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_09201
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners Luo, Xiaoxue Huang, Jinwei Yang, Runyan Gao, Yingying Feng, Junlan Deng, Chao Zhang, Shilei Sound Universal audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, often contains mixed speech and background sounds, and downstream tasks require selective access to these components. Therefore, we rethink the audio codec as a universal disentangled representation learner to enable controllable feature selection across different audio tasks. To this end, we introduce DeCodec, a novel neural codec that learns to decouple audio representations into orthogonal subspaces dedicated to speech and background sound, and within speech, representations are further decomposed into semantic and paralinguistic components. This hierarchical disentanglement allows flexible feature selection, making DeCodec a universal front-end for multiple audio applications. Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition. Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities: superior speech enhancement and effective one-shot voice conversion on noisy speech via representation recombination, improved ASR robustness through clean semantic features, and controllable background sound preservation/suppression in TTS. Demo Page: https://luo404.github.io/DeCodecV2/
title	DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners
topic	Sound
url	https://arxiv.org/abs/2509.09201

Documenti analoghi