Saved in:
Bibliographic Details
Main Authors: Poon, Crystal Min Hui, Ng, Pai Chet, Miao, Xiaoxiao, Loh, Immanuel Jun Kai, Zhang, Bowen, Song, Haoyu, Mcloughlin, Ian
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.11104
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908837089378304
author Poon, Crystal Min Hui
Ng, Pai Chet
Miao, Xiaoxiao
Loh, Immanuel Jun Kai
Zhang, Bowen
Song, Haoyu
Mcloughlin, Ian
author_facet Poon, Crystal Min Hui
Ng, Pai Chet
Miao, Xiaoxiao
Loh, Immanuel Jun Kai
Zhang, Bowen
Song, Haoyu
Mcloughlin, Ian
contents Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist in reducing perceived quality: accent bias, where models default towards dominant phonetic patterns, and linguistic bias, a misalignment in dialect-specific lexical or cultural information. These biases are interdependent and authentic accent generation requires both accent fidelity and correctly localized text. We present CLARITY (Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis), a backbone-agnostic framework to address both biases through dual-signal optimization. Firstly, we apply contextual linguistic adaptation to localize input text to align with the target dialect. Secondly, we propose retrieval-augmented accent prompting (RAAP) to ensure accent-consistent speech prompts. We evaluate CLARITY on twelve varieties of English accent via both subjective and objective analysis. Results clearly indicate that CLARITY improves accent accuracy and fairness, ensuring higher perceptual quality output\footnote{Code and audio samples are available at https://github.com/ICT-SIT/CLARITY.
format Preprint
id arxiv_https___arxiv_org_abs_2511_11104
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation
Poon, Crystal Min Hui
Ng, Pai Chet
Miao, Xiaoxiao
Loh, Immanuel Jun Kai
Zhang, Bowen
Song, Haoyu
Mcloughlin, Ian
Sound
Computation and Language
Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist in reducing perceived quality: accent bias, where models default towards dominant phonetic patterns, and linguistic bias, a misalignment in dialect-specific lexical or cultural information. These biases are interdependent and authentic accent generation requires both accent fidelity and correctly localized text. We present CLARITY (Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis), a backbone-agnostic framework to address both biases through dual-signal optimization. Firstly, we apply contextual linguistic adaptation to localize input text to align with the target dialect. Secondly, we propose retrieval-augmented accent prompting (RAAP) to ensure accent-consistent speech prompts. We evaluate CLARITY on twelve varieties of English accent via both subjective and objective analysis. Results clearly indicate that CLARITY improves accent accuracy and fairness, ensuring higher perceptual quality output\footnote{Code and audio samples are available at https://github.com/ICT-SIT/CLARITY.
title CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation
topic Sound
Computation and Language
url https://arxiv.org/abs/2511.11104