Saved in:
Bibliographic Details
Main Authors: Li, Xueqing, Ma, Hao, Li, Zehan, Chen, Rujin, Zhu, Boyu, Jing, Ruihao, Kang, Jian, Li, Jie, Zhang, Chi, Zhang, Xiao-Lei, Li, Xuelong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.04721
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912694739664896
author Li, Xueqing
Ma, Hao
Li, Zehan
Chen, Rujin
Zhu, Boyu
Jing, Ruihao
Kang, Jian
Li, Jie
Zhang, Chi
Zhang, Xiao-Lei
Li, Xuelong
author_facet Li, Xueqing
Ma, Hao
Li, Zehan
Chen, Rujin
Zhu, Boyu
Jing, Ruihao
Kang, Jian
Li, Jie
Zhang, Chi
Zhang, Xiao-Lei
Li, Xuelong
contents Self-supervised learning (SSL) has become a core technique in speech processing, but the high dimensionality of its representations makes discretization essential for improving efficiency. However, existing discretization methods still suffer from significant information loss, resulting in a notable performance gap compared to continuous representations. To overcome these limitations, we propose two quantization-based discretization methods: Product Quantization (PQ) and Random Product Quantization (RPQ). PQ partitions the original feature space into multiple subspaces and independently quantizes each sub-vector, producing a fused set of discrete units that retain diverse information from different subspaces, thereby mitigating the loss associated with single-cluster quantization. RPQ further enhances representation diversity by randomly sampling a fixed proportion of feature dimensions multiple times to construct sub-vectors, thereby better capturing the variability in the data distribution. Theoretical analysis shows that RPQ reduces the correlation coefficient rho (where 0 <= rho <= 1) between sub-quantizers. Its quantization error is lower-bounded by the product of rho and epsilon-kms, where epsilon-kms denotes the quantization error of a single K-means quantizer. Experimental results on a combined dataset built from LibriSpeech and ML-SUPERB show that PQ and RPQ outperform standard K-means discretization, achieving relative improvements of 21.8 percent and 20.0 percent in WER on LibriSpeech, and 24.1 percent and 19.6 percent in CER on ML-SUPERB, respectively. Moreover, their performance is competitive with, and in some cases even surpasses, that of continuous SSL representations.
format Preprint
id arxiv_https___arxiv_org_abs_2504_04721
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Bridging the Gap between Continuous and Informative Discrete Representations by Random Product Quantization
Li, Xueqing
Ma, Hao
Li, Zehan
Chen, Rujin
Zhu, Boyu
Jing, Ruihao
Kang, Jian
Li, Jie
Zhang, Chi
Zhang, Xiao-Lei
Li, Xuelong
Audio and Speech Processing
Self-supervised learning (SSL) has become a core technique in speech processing, but the high dimensionality of its representations makes discretization essential for improving efficiency. However, existing discretization methods still suffer from significant information loss, resulting in a notable performance gap compared to continuous representations. To overcome these limitations, we propose two quantization-based discretization methods: Product Quantization (PQ) and Random Product Quantization (RPQ). PQ partitions the original feature space into multiple subspaces and independently quantizes each sub-vector, producing a fused set of discrete units that retain diverse information from different subspaces, thereby mitigating the loss associated with single-cluster quantization. RPQ further enhances representation diversity by randomly sampling a fixed proportion of feature dimensions multiple times to construct sub-vectors, thereby better capturing the variability in the data distribution. Theoretical analysis shows that RPQ reduces the correlation coefficient rho (where 0 <= rho <= 1) between sub-quantizers. Its quantization error is lower-bounded by the product of rho and epsilon-kms, where epsilon-kms denotes the quantization error of a single K-means quantizer. Experimental results on a combined dataset built from LibriSpeech and ML-SUPERB show that PQ and RPQ outperform standard K-means discretization, achieving relative improvements of 21.8 percent and 20.0 percent in WER on LibriSpeech, and 24.1 percent and 19.6 percent in CER on ML-SUPERB, respectively. Moreover, their performance is competitive with, and in some cases even surpasses, that of continuous SSL representations.
title Bridging the Gap between Continuous and Informative Discrete Representations by Random Product Quantization
topic Audio and Speech Processing
url https://arxiv.org/abs/2504.04721