:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sun, Haiyang, Hu, Shujie, Liu, Shujie, Meng, Lingwei, Wang, Hui, Han, Bing, Yang, Yifan, Liu, Yanqing, Zhao, Sheng, Lu, Yan, Qian, Yanmin
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2505.19669
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
by: Han, Bing, et al.
Published: (2024)

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2025)

StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling
by: Wang, Hui, et al.
Published: (2025)

Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024)

Autoregressive Speech Synthesis without Vector Quantization
by: Meng, Lingwei, et al.
Published: (2024)

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
by: Wang, Hui, et al.
Published: (2025)

Advanced Long-Content Speech Recognition With Factorized Neural Transducer
by: Gong, Xun, et al.
Published: (2024)

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
by: Chen, Sanyuan, et al.
Published: (2024)

A Unified Neural Codec Language Model for Selective Editable Text to Speech Generation
by: Pei, Hanchen, et al.
Published: (2026)

Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
by: Yuan, Ze, et al.
Published: (2024)

Position: Towards Responsible Evaluation for Text-to-Speech
by: Yang, Yifan, et al.
Published: (2025)

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
by: Zhang, Leying, et al.
Published: (2024)

Next Tokens Denoising for Speech Synthesis
by: Liu, Yanqing, et al.
Published: (2025)

Boosting Large Language Model for Speech Synthesis: An Empirical Study
by: Hao, Hongkun, et al.
Published: (2023)

WavLLM: Towards Robust and Adaptive Speech Large Language Model
by: Hu, Shujie, et al.
Published: (2024)

Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis
by: Lu, Ye-Xin, et al.
Published: (2024)

Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction
by: Zhang, Leying, et al.
Published: (2025)

DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice
by: Zhang, Leying, et al.
Published: (2026)

DDTSE: Discriminative Diffusion Model for Target Speech Extraction
by: Zhang, Leying, et al.
Published: (2023)

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions
by: Meng, Lingwei, et al.
Published: (2024)

Exploring Cross-Utterance Speech Contexts for Conformer-Transducer Speech Recognition Systems
by: Cui, Mingyu, et al.
Published: (2025)

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
by: Le, Chenyang, et al.
Published: (2024)

ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation
by: Li, Zongyi, et al.
Published: (2024)

Risk Factors of Thrombocytopenia After Cardiac Surgery with Cardiopulmonary Bypass
by: Shujie Yan
Published: (2023)

Transfer Learning for High Dimensional Robust Regression
by: Yuan, Xiaohui, et al.
Published: (2024)

AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
by: Fan, Ruchao, et al.
Published: (2024)

Zero-Shot Text-to-Speech from Continuous Text Streams
by: Dang, Trung, et al.
Published: (2024)

Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech Recognition
by: Zhong, Tao, et al.
Published: (2025)

Closing the Modality Reasoning Gap for Speech Large Language Models
by: Wang, Chaoren, et al.
Published: (2026)

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
by: Xin, Detai, et al.
Published: (2024)

FlashSpeech: Efficient Zero-Shot Speech Synthesis
by: Ye, Zhen, et al.
Published: (2024)

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
by: Wang, Hui, et al.
Published: (2025)

Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness
by: Yu, Lu, et al.
Published: (2026)

Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text
by: Xu, Hainan, et al.
Published: (2026)

Anti-Disturbance Hierarchical Sliding Mode Controller for Deep-Sea Cranes with Adaptive Control and Neural Network Compensation
by: Zuo, Qian, et al.
Published: (2025)

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
by: Zhou, Siyi, et al.
Published: (2025)

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
by: Ju, Zeqian, et al.
Published: (2024)

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning
by: Wang, Dingdong, et al.
Published: (2026)

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
by: Hu, Yuchen, et al.
Published: (2024)

High-Winding-Number Zero-Energy Edge States in Rhombohedral-Stacked Su-Schrieffer-Heeger Multilayers
by: Lu, Feng, et al.
Published: (2025)