Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gulzar, Kashaf, Wagner, Dominik, Bayerl, Sebastian P., Hönig, Florian, Bocklet, Tobias, Riedhammer, Korbinian
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2512.02027
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911296826376192
author	Gulzar, Kashaf Wagner, Dominik Bayerl, Sebastian P. Hönig, Florian Bocklet, Tobias Riedhammer, Korbinian
author_facet	Gulzar, Kashaf Wagner, Dominik Bayerl, Sebastian P. Hönig, Florian Bocklet, Tobias Riedhammer, Korbinian
contents	Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer's English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_02027
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts Gulzar, Kashaf Wagner, Dominik Bayerl, Sebastian P. Hönig, Florian Bocklet, Tobias Riedhammer, Korbinian Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer's English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.
title	On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts
topic	Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning
url	https://arxiv.org/abs/2512.02027

Similar Items