Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yeh, Sung-Lin, Bell, Peter, Tang, Hao
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing Computation and Language
Online Access:	https://arxiv.org/abs/2601.00100
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911349944090624
author	Yeh, Sung-Lin Bell, Peter Tang, Hao
author_facet	Yeh, Sung-Lin Bell, Peter Tang, Hao
contents	Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_00100
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Learning Speech Representations with Variational Predictive Coding Yeh, Sung-Lin Bell, Peter Tang, Hao Audio and Speech Processing Computation and Language Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation.
title	Learning Speech Representations with Variational Predictive Coding
topic	Audio and Speech Processing Computation and Language
url	https://arxiv.org/abs/2601.00100

Similar Items