Enregistré dans:
Détails bibliographiques
Auteurs principaux: Huang, Ruizhe, Zhang, Xiaohui, Ni, Zhaoheng, Sun, Li, Hira, Moto, Hwang, Jeff, Manohar, Vimal, Pratap, Vineel, Wiesner, Matthew, Watanabe, Shinji, Povey, Daniel, Khudanpur, Sanjeev
Format: Preprint
Publié: 2024
Sujets:
Accès en ligne:https://arxiv.org/abs/2406.02560
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866910534817808384
author Huang, Ruizhe
Zhang, Xiaohui
Ni, Zhaoheng
Sun, Li
Hira, Moto
Hwang, Jeff
Manohar, Vimal
Pratap, Vineel
Wiesner, Matthew
Watanabe, Shinji
Povey, Daniel
Khudanpur, Sanjeev
author_facet Huang, Ruizhe
Zhang, Xiaohui
Ni, Zhaoheng
Sun, Li
Hira, Moto
Hwang, Jeff
Manohar, Vimal
Pratap, Vineel
Wiesner, Matthew
Watanabe, Shinji
Povey, Daniel
Khudanpur, Sanjeev
contents Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.
format Preprint
id arxiv_https___arxiv_org_abs_2406_02560
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Less Peaky and More Accurate CTC Forced Alignment by Label Priors
Huang, Ruizhe
Zhang, Xiaohui
Ni, Zhaoheng
Sun, Li
Hira, Moto
Hwang, Jeff
Manohar, Vimal
Pratap, Vineel
Wiesner, Matthew
Watanabe, Shinji
Povey, Daniel
Khudanpur, Sanjeev
Audio and Speech Processing
Artificial Intelligence
Computation and Language
Machine Learning
Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.
title Less Peaky and More Accurate CTC Forced Alignment by Label Priors
topic Audio and Speech Processing
Artificial Intelligence
Computation and Language
Machine Learning
url https://arxiv.org/abs/2406.02560