Saved in:
Bibliographic Details
Main Authors: Li, Zhipeng, Xing, Xiaofen, Wang, Jun, Chen, Shuaiqi, Yu, Guoqiao, Wan, Guanglu, Xu, Xiangmin
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.05730
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913494179250176
author Li, Zhipeng
Xing, Xiaofen
Wang, Jun
Chen, Shuaiqi
Yu, Guoqiao
Wan, Guanglu
Xu, Xiangmin
author_facet Li, Zhipeng
Xing, Xiaofen
Wang, Jun
Chen, Shuaiqi
Yu, Guoqiao
Wan, Guanglu
Xu, Xiangmin
contents In recent years, there has been significant progress in Text-to-Speech (TTS) synthesis technology, enabling the high-quality synthesis of voices in common scenarios. In unseen situations, adaptive TTS requires a strong generalization capability to speaker style characteristics. However, the existing adaptive methods can only extract and integrate coarse-grained timbre or mixed rhythm attributes separately. In this paper, we propose AS-Speech, an adaptive style methodology that integrates the speaker timbre characteristics and rhythmic attributes into a unified framework for text-to-speech synthesis. Specifically, AS-Speech can accurately simulate style characteristics through fine-grained text-based timbre features and global rhythm information, and achieve high-fidelity speech synthesis through the diffusion model. Experiments show that the proposed model produces voices with higher naturalness and similarity in terms of timbre and rhythm compared to a series of adaptive TTS models.
format Preprint
id arxiv_https___arxiv_org_abs_2409_05730
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle AS-Speech: Adaptive Style For Speech Synthesis
Li, Zhipeng
Xing, Xiaofen
Wang, Jun
Chen, Shuaiqi
Yu, Guoqiao
Wan, Guanglu
Xu, Xiangmin
Audio and Speech Processing
In recent years, there has been significant progress in Text-to-Speech (TTS) synthesis technology, enabling the high-quality synthesis of voices in common scenarios. In unseen situations, adaptive TTS requires a strong generalization capability to speaker style characteristics. However, the existing adaptive methods can only extract and integrate coarse-grained timbre or mixed rhythm attributes separately. In this paper, we propose AS-Speech, an adaptive style methodology that integrates the speaker timbre characteristics and rhythmic attributes into a unified framework for text-to-speech synthesis. Specifically, AS-Speech can accurately simulate style characteristics through fine-grained text-based timbre features and global rhythm information, and achieve high-fidelity speech synthesis through the diffusion model. Experiments show that the proposed model produces voices with higher naturalness and similarity in terms of timbre and rhythm compared to a series of adaptive TTS models.
title AS-Speech: Adaptive Style For Speech Synthesis
topic Audio and Speech Processing
url https://arxiv.org/abs/2409.05730