Guardado en:
Detalles Bibliográficos
Autores principales: Yang, Dongchao, Liu, Songxiang, Guo, Haohan, Zhao, Jiankun, Wang, Yuanyuan, Wang, Helin, Ju, Zeqian, Liu, Xubo, Chen, Xueyuan, Tan, Xu, Wu, Xixin, Meng, Helen
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2504.10344
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866916688909303808
author Yang, Dongchao
Liu, Songxiang
Guo, Haohan
Zhao, Jiankun
Wang, Yuanyuan
Wang, Helin
Ju, Zeqian
Liu, Xubo
Chen, Xueyuan
Tan, Xu
Wu, Xixin
Meng, Helen
author_facet Yang, Dongchao
Liu, Songxiang
Guo, Haohan
Zhao, Jiankun
Wang, Yuanyuan
Wang, Helin
Ju, Zeqian
Liu, Xubo
Chen, Xueyuan
Tan, Xu
Wu, Xixin
Meng, Helen
contents Recent advancements in audio language models have underscored the pivotal role of audio tokenization, which converts audio signals into discrete tokens, thereby facilitating the application of language model architectures to the audio domain. In this study, we introduce ALMTokenizer, a novel low-bitrate and semantically rich audio codec tokenizer for audio language models. Prior methods, such as Encodec, typically encode individual audio frames into discrete tokens without considering the use of context information across frames. Unlike these methods, we introduce a novel query-based compression strategy to capture holistic information with a set of learnable query tokens by explicitly modeling the context information across frames. This design not only enables the codec model to capture more semantic information but also encodes the audio signal with fewer token sequences. Additionally, to enhance the semantic information in audio codec models, we introduce the following: (1) A masked autoencoder (MAE) loss, (2) Vector quantization based on semantic priors, and (3) An autoregressive (AR) prediction loss. As a result, ALMTokenizer achieves competitive reconstruction performance relative to state-of-the-art approaches while operating at a lower bitrate. Within the same audio language model framework, ALMTokenizer outperforms previous tokenizers in audio understanding and generation tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2504_10344
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
Yang, Dongchao
Liu, Songxiang
Guo, Haohan
Zhao, Jiankun
Wang, Yuanyuan
Wang, Helin
Ju, Zeqian
Liu, Xubo
Chen, Xueyuan
Tan, Xu
Wu, Xixin
Meng, Helen
Sound
Recent advancements in audio language models have underscored the pivotal role of audio tokenization, which converts audio signals into discrete tokens, thereby facilitating the application of language model architectures to the audio domain. In this study, we introduce ALMTokenizer, a novel low-bitrate and semantically rich audio codec tokenizer for audio language models. Prior methods, such as Encodec, typically encode individual audio frames into discrete tokens without considering the use of context information across frames. Unlike these methods, we introduce a novel query-based compression strategy to capture holistic information with a set of learnable query tokens by explicitly modeling the context information across frames. This design not only enables the codec model to capture more semantic information but also encodes the audio signal with fewer token sequences. Additionally, to enhance the semantic information in audio codec models, we introduce the following: (1) A masked autoencoder (MAE) loss, (2) Vector quantization based on semantic priors, and (3) An autoregressive (AR) prediction loss. As a result, ALMTokenizer achieves competitive reconstruction performance relative to state-of-the-art approaches while operating at a lower bitrate. Within the same audio language model framework, ALMTokenizer outperforms previous tokenizers in audio understanding and generation tasks.
title ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
topic Sound
url https://arxiv.org/abs/2504.10344