Saved in:
Bibliographic Details
Main Authors: Yeh, Sung-Lin, Tang, Hao
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.06109
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917817641598976
author Yeh, Sung-Lin
Tang, Hao
author_facet Yeh, Sung-Lin
Tang, Hao
contents Representing speech with discrete units has been widely used in speech codec and speech generation. However, there are several unverified claims about self-supervised discrete units, such as disentangling phonetic and speaker information with k-means, or assuming information loss after k-means. In this work, we take an information-theoretic perspective to answer how much information is present (information completeness) and how much information is accessible (information accessibility), before and after residual vector quantization. We show a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization. We find that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve disentanglement. Our results offer a comprehensive assessment on the choice of discrete units, and suggest that a lot more information in the residual should be mined rather than discarded.
format Preprint
id arxiv_https___arxiv_org_abs_2409_06109
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Estimating the Completeness of Discrete Speech Units
Yeh, Sung-Lin
Tang, Hao
Audio and Speech Processing
Computation and Language
Representing speech with discrete units has been widely used in speech codec and speech generation. However, there are several unverified claims about self-supervised discrete units, such as disentangling phonetic and speaker information with k-means, or assuming information loss after k-means. In this work, we take an information-theoretic perspective to answer how much information is present (information completeness) and how much information is accessible (information accessibility), before and after residual vector quantization. We show a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization. We find that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve disentanglement. Our results offer a comprehensive assessment on the choice of discrete units, and suggest that a lot more information in the residual should be mined rather than discarded.
title Estimating the Completeness of Discrete Speech Units
topic Audio and Speech Processing
Computation and Language
url https://arxiv.org/abs/2409.06109