Saved in:
Bibliographic Details
Main Authors: Gulzar, Kashaf, Wagner, Dominik, Bayerl, Sebastian P., Hönig, Florian, Bocklet, Tobias, Riedhammer, Korbinian
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.02027
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911296826376192
author Gulzar, Kashaf
Wagner, Dominik
Bayerl, Sebastian P.
Hönig, Florian
Bocklet, Tobias
Riedhammer, Korbinian
author_facet Gulzar, Kashaf
Wagner, Dominik
Bayerl, Sebastian P.
Hönig, Florian
Bocklet, Tobias
Riedhammer, Korbinian
contents Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer's English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.
format Preprint
id arxiv_https___arxiv_org_abs_2512_02027
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts
Gulzar, Kashaf
Wagner, Dominik
Bayerl, Sebastian P.
Hönig, Florian
Bocklet, Tobias
Riedhammer, Korbinian
Audio and Speech Processing
Artificial Intelligence
Computation and Language
Machine Learning
Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer's English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.
title On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts
topic Audio and Speech Processing
Artificial Intelligence
Computation and Language
Machine Learning
url https://arxiv.org/abs/2512.02027