Saved in:
| Main Authors: | Sun, Haiyang, Hu, Shujie, Liu, Shujie, Meng, Lingwei, Wang, Hui, Han, Bing, Yang, Yifan, Liu, Yanqing, Zhao, Sheng, Lu, Yan, Qian, Yanmin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.19669 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
by: Han, Bing, et al.
Published: (2024)
by: Han, Bing, et al.
Published: (2024)
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2025)
by: Yang, Yifan, et al.
Published: (2025)
StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling
by: Wang, Hui, et al.
Published: (2025)
by: Wang, Hui, et al.
Published: (2025)
Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024)
by: Yang, Yifan, et al.
Published: (2024)
Autoregressive Speech Synthesis without Vector Quantization
by: Meng, Lingwei, et al.
Published: (2024)
by: Meng, Lingwei, et al.
Published: (2024)
FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
by: Wang, Hui, et al.
Published: (2025)
by: Wang, Hui, et al.
Published: (2025)
Advanced Long-Content Speech Recognition With Factorized Neural Transducer
by: Gong, Xun, et al.
Published: (2024)
by: Gong, Xun, et al.
Published: (2024)
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
by: Chen, Sanyuan, et al.
Published: (2024)
by: Chen, Sanyuan, et al.
Published: (2024)
A Unified Neural Codec Language Model for Selective Editable Text to Speech Generation
by: Pei, Hanchen, et al.
Published: (2026)
by: Pei, Hanchen, et al.
Published: (2026)
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
by: Yuan, Ze, et al.
Published: (2024)
by: Yuan, Ze, et al.
Published: (2024)
Position: Towards Responsible Evaluation for Text-to-Speech
by: Yang, Yifan, et al.
Published: (2025)
by: Yang, Yifan, et al.
Published: (2025)
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
by: Zhang, Leying, et al.
Published: (2024)
by: Zhang, Leying, et al.
Published: (2024)
Next Tokens Denoising for Speech Synthesis
by: Liu, Yanqing, et al.
Published: (2025)
by: Liu, Yanqing, et al.
Published: (2025)
Boosting Large Language Model for Speech Synthesis: An Empirical Study
by: Hao, Hongkun, et al.
Published: (2023)
by: Hao, Hongkun, et al.
Published: (2023)
WavLLM: Towards Robust and Adaptive Speech Large Language Model
by: Hu, Shujie, et al.
Published: (2024)
by: Hu, Shujie, et al.
Published: (2024)
Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis
by: Lu, Ye-Xin, et al.
Published: (2024)
by: Lu, Ye-Xin, et al.
Published: (2024)
Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction
by: Zhang, Leying, et al.
Published: (2025)
by: Zhang, Leying, et al.
Published: (2025)
DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice
by: Zhang, Leying, et al.
Published: (2026)
by: Zhang, Leying, et al.
Published: (2026)
DDTSE: Discriminative Diffusion Model for Target Speech Extraction
by: Zhang, Leying, et al.
Published: (2023)
by: Zhang, Leying, et al.
Published: (2023)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions
by: Meng, Lingwei, et al.
Published: (2024)
by: Meng, Lingwei, et al.
Published: (2024)
Exploring Cross-Utterance Speech Contexts for Conformer-Transducer Speech Recognition Systems
by: Cui, Mingyu, et al.
Published: (2025)
by: Cui, Mingyu, et al.
Published: (2025)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
by: Le, Chenyang, et al.
Published: (2024)
by: Le, Chenyang, et al.
Published: (2024)
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation
by: Li, Zongyi, et al.
Published: (2024)
by: Li, Zongyi, et al.
Published: (2024)
Risk Factors of Thrombocytopenia After Cardiac Surgery with Cardiopulmonary Bypass
by: Shujie Yan
Published: (2023)
by: Shujie Yan
Published: (2023)
Transfer Learning for High Dimensional Robust Regression
by: Yuan, Xiaohui, et al.
Published: (2024)
by: Yuan, Xiaohui, et al.
Published: (2024)
AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM
by: Fan, Ruchao, et al.
Published: (2024)
by: Fan, Ruchao, et al.
Published: (2024)
Zero-Shot Text-to-Speech from Continuous Text Streams
by: Dang, Trung, et al.
Published: (2024)
by: Dang, Trung, et al.
Published: (2024)
Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech Recognition
by: Zhong, Tao, et al.
Published: (2025)
by: Zhong, Tao, et al.
Published: (2025)
Closing the Modality Reasoning Gap for Speech Large Language Models
by: Wang, Chaoren, et al.
Published: (2026)
by: Wang, Chaoren, et al.
Published: (2026)
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
by: Xin, Detai, et al.
Published: (2024)
by: Xin, Detai, et al.
Published: (2024)
FlashSpeech: Efficient Zero-Shot Speech Synthesis
by: Ye, Zhen, et al.
Published: (2024)
by: Ye, Zhen, et al.
Published: (2024)
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
by: Wang, Hui, et al.
Published: (2025)
by: Wang, Hui, et al.
Published: (2025)
Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness
by: Yu, Lu, et al.
Published: (2026)
by: Yu, Lu, et al.
Published: (2026)
Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text
by: Xu, Hainan, et al.
Published: (2026)
by: Xu, Hainan, et al.
Published: (2026)
Anti-Disturbance Hierarchical Sliding Mode Controller for Deep-Sea Cranes with Adaptive Control and Neural Network Compensation
by: Zuo, Qian, et al.
Published: (2025)
by: Zuo, Qian, et al.
Published: (2025)
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
by: Zhou, Siyi, et al.
Published: (2025)
by: Zhou, Siyi, et al.
Published: (2025)
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
by: Ju, Zeqian, et al.
Published: (2024)
by: Ju, Zeqian, et al.
Published: (2024)
EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning
by: Wang, Dingdong, et al.
Published: (2026)
by: Wang, Dingdong, et al.
Published: (2026)
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
High-Winding-Number Zero-Energy Edge States in Rhombohedral-Stacked Su-Schrieffer-Heeger Multilayers
by: Lu, Feng, et al.
Published: (2025)
by: Lu, Feng, et al.
Published: (2025)
Similar Items
-
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
by: Han, Bing, et al.
Published: (2024) -
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2025) -
StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling
by: Wang, Hui, et al.
Published: (2025) -
Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024) -
Autoregressive Speech Synthesis without Vector Quantization
by: Meng, Lingwei, et al.
Published: (2024)