:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chen, Dekun, Zhang, Xueyao, Wang, Yuancheng, Dai, Kenan, Ma, Li, Wu, Zhizheng
Format:	Preprint
Published:	2026
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2601.04656
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora
by: Feng, Tao, et al.
Published: (2026)

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling
by: Wang, Yuancheng, et al.
Published: (2025)

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models
by: Wang, Yuxiang, et al.
Published: (2026)

Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
by: Zhang, Xueyao, et al.
Published: (2025)

Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation
by: Zhang, Xueyao, et al.
Published: (2025)

Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement
by: Zhang, Xueyao, et al.
Published: (2025)

Voice Impression Control in Zero-Shot TTS
by: Fujita, Kenichi, et al.
Published: (2025)

ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis
by: Li, Haitao, et al.
Published: (2026)

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
by: Wang, Yuancheng, et al.
Published: (2024)

VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
by: Peng, Puyuan, et al.
Published: (2025)

Noro: Noise-Robust One-shot Voice Conversion with Hidden Speaker Representation Learning
by: He, Haorui, et al.
Published: (2024)

DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation
by: Meng, Ming, et al.
Published: (2025)

Multi-Metric Preference Alignment for Generative Speech Restoration
by: Zhang, Junan, et al.
Published: (2025)

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness
by: Zhang, Xueyao, et al.
Published: (2025)

Metis: A Foundation Speech Generation Model with Masked Generative Pre-training
by: Wang, Yuancheng, et al.
Published: (2025)

The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024
by: Zhou, Shuoyi, et al.
Published: (2024)

AnyAccomp: Generalizable Accompaniment Generation via Quantized Melodic Bottleneck
by: Zhang, Junan, et al.
Published: (2025)

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control
by: Zhang, Yu, et al.
Published: (2024)

StyleStream: Real-Time Zero-Shot Voice Style Conversion
by: Liu, Yisi, et al.
Published: (2026)

SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion
by: Xue, Liumeng, et al.
Published: (2024)

SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset
by: Gu, Yicheng, et al.
Published: (2025)

Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion
by: Zhang, Xueyao, et al.
Published: (2023)

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
by: Ao, Junyi, et al.
Published: (2024)

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates
by: Li, Jiaqi, et al.
Published: (2025)

An Extensive Analysis of the Singing Voice Conversion Challenge 2025 Evaluation Results
by: Violeta, Lester Phillip, et al.
Published: (2025)

StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching
by: Yao, Jixun, et al.
Published: (2024)

Zero-shot Cross-lingual Voice Transfer for TTS
by: Biadsy, Fadi, et al.
Published: (2024)

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
by: Li, Yinghao Aaron, et al.
Published: (2024)

GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor
by: Lee, Seokgi, et al.
Published: (2025)

Iterate to Differentiate: Enhancing Discriminability and Reliability in Zero-Shot TTS Evaluation
by: Shen, Shengfan, et al.
Published: (2026)

NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation
by: Ni, Qinke, et al.
Published: (2026)

AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis
by: Luo, Dan, et al.
Published: (2025)

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
by: Gu, Yicheng, et al.
Published: (2024)

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
by: Xu, Rixi, et al.
Published: (2026)

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis
by: Chen, Zhiyong, et al.
Published: (2024)

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
by: Eskimez, Sefik Emre, et al.
Published: (2024)

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2023)

Closing the Modality Reasoning Gap for Speech Large Language Models
by: Wang, Chaoren, et al.
Published: (2026)

Intelli-Z: Toward Intelligible Zero-Shot TTS
by: Jung, Sunghee, et al.
Published: (2024)

Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages
by: Arora, Akshit, et al.
Published: (2024)