:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Ruiqi, Zhang, Yu, Pan, Changhao, Lei, Ke, Yin, Xiang, Yang, Cheng
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2605.30993
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control
by: Zhang, Yu, et al.
Published: (2024)

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
by: Jiang, Yuepeng, et al.
Published: (2024)

REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers
by: Jiang, Yuepeng, et al.
Published: (2025)

Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling
by: Yang, Yuguang, et al.
Published: (2024)

Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference
by: Dai, Shuqi, et al.
Published: (2025)

Generative Expressive Conversational Speech Synthesis
by: Liu, Rui, et al.
Published: (2024)

SF-Speech: Straightened Flow for Zero-Shot Voice Clone
by: Li, Xuyuan, et al.
Published: (2024)

Debatts: Zero-Shot Debating Text-to-Speech Synthesis
by: Huang, Yiqiao, et al.
Published: (2024)

StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching
by: Yao, Jixun, et al.
Published: (2024)

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
by: Zhang, Yu, et al.
Published: (2025)

Text-aware and Context-aware Expressive Audiobook Speech Synthesis
by: Guo, Dake, et al.
Published: (2024)

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2023)

MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2025)

UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models
by: Tu, Wenming, et al.
Published: (2025)

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion
by: Akti, Seymanur, et al.
Published: (2025)

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
by: Lei, Ke, et al.
Published: (2026)

Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches
by: Pan, Changhao, et al.
Published: (2026)

Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis
by: Lu, Ye-Xin, et al.
Published: (2024)

ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
by: Zhu, Han, et al.
Published: (2025)

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing
by: Anastassiou, Philip, et al.
Published: (2024)

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning
by: Zhu, Xinfa, et al.
Published: (2023)

FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech
by: Ma, Linhan, et al.
Published: (2025)

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2024)

Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching
by: Pan, Yu, et al.
Published: (2024)

Robust Singing Voice Transcription Serves Synthesis
by: Li, Ruiqi, et al.
Published: (2024)

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech
by: Bak, Taejun, et al.
Published: (2024)

Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech
by: Avdeeva, Anastasia, et al.
Published: (2024)

Zero-Shot Duet Singing Voices Separation with Diffusion Models
by: Yu, Chin-Yun, et al.
Published: (2023)

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
by: Wang, Tianrui, et al.
Published: (2025)

ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism
by: Chou, Hsing-Hang, et al.
Published: (2024)

HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
by: Nishimura, Yuto, et al.
Published: (2024)

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
by: Yang, Qian, et al.
Published: (2024)

RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding
by: Liu, Yisi, et al.
Published: (2025)

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
by: Zhu, Han, et al.
Published: (2026)

CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition
by: Wang, Jianzong, et al.
Published: (2024)

SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis
by: Qian, Jiale, et al.
Published: (2026)

Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio
by: Shi, Mohan, et al.
Published: (2025)

Zero-Shot Mono-to-Binaural Speech Synthesis
by: Levkovitch, Alon, et al.
Published: (2024)

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
by: Zhang, Bowen, et al.
Published: (2025)

Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion
by: Li, Ruiqi, et al.
Published: (2024)