Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Feng, Tao, Wang, Yuxiang, Wang, Yuancheng, Zhang, Xueyao, Chen, Dekun, Wang, Chaoren, Guan, Xun, Wu, Zhizheng
Formato:	Preprint
Publicado:	2026
Materias:	Sound Computation and Language
Acceso en línea:	https://arxiv.org/abs/2604.11552
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866908977581785088
author	Feng, Tao Wang, Yuxiang Wang, Yuancheng Zhang, Xueyao Chen, Dekun Wang, Chaoren Guan, Xun Wu, Zhizheng
author_facet	Feng, Tao Wang, Yuxiang Wang, Yuancheng Zhang, Xueyao Chen, Dekun Wang, Chaoren Guan, Xun Wu, Zhizheng
contents	Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference's voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling. Building on this data construction approach, we incorporate interleaved text-audio modeling to guide the generation of content-accurate speech and apply post-training with preference alignment to mitigate the inherent distributional mismatch when training on synthetic data. Experiments demonstrate that MimicLM achieves superior voice imitation quality with a simple yet effective architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_11552
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora Feng, Tao Wang, Yuxiang Wang, Yuancheng Zhang, Xueyao Chen, Dekun Wang, Chaoren Guan, Xun Wu, Zhizheng Sound Computation and Language Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference's voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling. Building on this data construction approach, we incorporate interleaved text-audio modeling to guide the generation of content-accurate speech and apply post-training with preference alignment to mitigate the inherent distributional mismatch when training on synthetic data. Experiments demonstrate that MimicLM achieves superior voice imitation quality with a simple yet effective architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.
title	MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora
topic	Sound Computation and Language
url	https://arxiv.org/abs/2604.11552

Ejemplares similares