Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Huang, Ruizhe, Zhang, Xiaohui, Ni, Zhaoheng, Sun, Li, Hira, Moto, Hwang, Jeff, Manohar, Vimal, Pratap, Vineel, Wiesner, Matthew, Watanabe, Shinji, Povey, Daniel, Khudanpur, Sanjeev
Format:	Preprint
Publié:	2024
Sujets:	Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning
Accès en ligne:	https://arxiv.org/abs/2406.02560
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866910534817808384
author	Huang, Ruizhe Zhang, Xiaohui Ni, Zhaoheng Sun, Li Hira, Moto Hwang, Jeff Manohar, Vimal Pratap, Vineel Wiesner, Matthew Watanabe, Shinji Povey, Daniel Khudanpur, Sanjeev
author_facet	Huang, Ruizhe Zhang, Xiaohui Ni, Zhaoheng Sun, Li Hira, Moto Hwang, Jeff Manohar, Vimal Pratap, Vineel Wiesner, Matthew Watanabe, Shinji Povey, Daniel Khudanpur, Sanjeev
contents	Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_02560
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Less Peaky and More Accurate CTC Forced Alignment by Label Priors Huang, Ruizhe Zhang, Xiaohui Ni, Zhaoheng Sun, Li Hira, Moto Hwang, Jeff Manohar, Vimal Pratap, Vineel Wiesner, Matthew Watanabe, Shinji Povey, Daniel Khudanpur, Sanjeev Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.
title	Less Peaky and More Accurate CTC Forced Alignment by Label Priors
topic	Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning
url	https://arxiv.org/abs/2406.02560

Documents similaires