:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xing, Jingyuan, Yang, Mingru, Li, Zhipeng, Xing, Xiaofen, Xu, Xiangmin
Format:	Preprint
Published:	2025
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2510.11646
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech
by: Xing, Jingyuan, et al.
Published: (2025)

Long-Context Speech Synthesis with Context-Aware Memory
by: Li, Zhipeng, et al.
Published: (2025)

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control
by: Mai, Jialong, et al.
Published: (2026)

MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech
by: Mai, Jialong, et al.
Published: (2025)

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition
by: Chen, Weidong, et al.
Published: (2023)

HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS
by: Nie, Sihang, et al.
Published: (2025)

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
by: Fang, Yuanbo, et al.
Published: (2025)

LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
by: Dang, Trung, et al.
Published: (2024)

AS-Speech: Adaptive Style For Speech Synthesis
by: Li, Zhipeng, et al.
Published: (2024)

Zero-Shot Text-to-Speech for Vietnamese
by: Vu, Thi, et al.
Published: (2025)

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
by: Wang, Tianrui, et al.
Published: (2025)

FlashSpeech: Efficient Zero-Shot Speech Synthesis
by: Ye, Zhen, et al.
Published: (2024)

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
by: Zhang, Bowen, et al.
Published: (2025)

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
by: Hu, Yuchen, et al.
Published: (2024)

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec
by: Li, Tao, et al.
Published: (2025)

Zero-Shot Text-to-Speech from Continuous Text Streams
by: Dang, Trung, et al.
Published: (2024)

Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
by: Liao, Shijia, et al.
Published: (2024)

Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora
by: Nespoli, Francesco, et al.
Published: (2024)

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
by: Liu, Zhijun, et al.
Published: (2024)

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
by: Jiang, Yuepeng, et al.
Published: (2024)

Zero-Shot Mono-to-Binaural Speech Synthesis
by: Levkovitch, Alon, et al.
Published: (2024)

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech
by: Ji, Shengpeng, et al.
Published: (2024)

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
by: Peng, Puyuan, et al.
Published: (2024)

Towards Zero-Shot Text-To-Speech for Arabic Dialects
by: Doan, Khai Duy, et al.
Published: (2024)

Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction
by: Zhang, Leying, et al.
Published: (2025)

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
by: Ju, Zeqian, et al.
Published: (2024)

Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding
by: Lin, Zijian, et al.
Published: (2025)

Parallel Synthesis for Autoregressive Speech Generation
by: Hsu, Po-chun, et al.
Published: (2022)

Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis
by: Chen, Zehua, et al.
Published: (2023)

MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora
by: Feng, Tao, et al.
Published: (2026)

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
by: Lei, Shun, et al.
Published: (2023)

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2023)

CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models
by: Chen, Junyang, et al.
Published: (2026)

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
by: Han, Bing, et al.
Published: (2024)

SF-Speech: Straightened Flow for Zero-Shot Voice Clone
by: Li, Xuyuan, et al.
Published: (2024)

ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
by: Zhu, Han, et al.
Published: (2025)

HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
by: Nishimura, Yuto, et al.
Published: (2024)

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
by: Casanova, Edresson, et al.
Published: (2024)

StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling
by: Wang, Hui, et al.
Published: (2025)

Autoregressive Speech Synthesis without Vector Quantization
by: Meng, Lingwei, et al.
Published: (2024)