Saved in:
Bibliographic Details
Main Authors: Yeh, Sung-Lin, Bell, Peter, Tang, Hao
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2601.00100
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911349944090624
author Yeh, Sung-Lin
Bell, Peter
Tang, Hao
author_facet Yeh, Sung-Lin
Bell, Peter
Tang, Hao
contents Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation.
format Preprint
id arxiv_https___arxiv_org_abs_2601_00100
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Learning Speech Representations with Variational Predictive Coding
Yeh, Sung-Lin
Bell, Peter
Tang, Hao
Audio and Speech Processing
Computation and Language
Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation.
title Learning Speech Representations with Variational Predictive Coding
topic Audio and Speech Processing
Computation and Language
url https://arxiv.org/abs/2601.00100