Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Dang, Trung, Rao, Sharath, Gupta, Ananya, Gagne, Christopher, Tzirakis, Panagiotis, Baird, Alice, Cłapa, Jakub Piotr, Chin, Peter, Cowen, Alan
Formato:	Preprint
Publicado:	2026
Materias:	Sound
Acceso en línea:	https://arxiv.org/abs/2602.23068
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866915819974295552
author	Dang, Trung Rao, Sharath Gupta, Ananya Gagne, Christopher Tzirakis, Panagiotis Baird, Alice Cłapa, Jakub Piotr Chin, Peter Cowen, Alan
author_facet	Dang, Trung Rao, Sharath Gupta, Ananya Gagne, Christopher Tzirakis, Panagiotis Baird, Alice Cłapa, Jakub Piotr Chin, Peter Cowen, Alan
contents	Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_23068
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment Dang, Trung Rao, Sharath Gupta, Ananya Gagne, Christopher Tzirakis, Panagiotis Baird, Alice Cłapa, Jakub Piotr Chin, Peter Cowen, Alan Sound Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.
title	TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment
topic	Sound
url	https://arxiv.org/abs/2602.23068

Ejemplares similares