Saved in:
| Main Authors: | Zheng, John, Maleki, Farhad |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.19668 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
by: Du, Zhihao, et al.
Published: (2024)
by: Du, Zhihao, et al.
Published: (2024)
R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion
by: Zheng, Junjie, et al.
Published: (2025)
by: Zheng, Junjie, et al.
Published: (2025)
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
by: Li, Xiquan, et al.
Published: (2024)
by: Li, Xiquan, et al.
Published: (2024)
Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision
by: Jia, Zhijun, et al.
Published: (2024)
by: Jia, Zhijun, et al.
Published: (2024)
Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications
by: Vecino, Biel Tura, et al.
Published: (2025)
by: Vecino, Biel Tura, et al.
Published: (2025)
EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion
by: Joglekar, Advait, et al.
Published: (2025)
by: Joglekar, Advait, et al.
Published: (2025)
Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures
by: Riou, Alain, et al.
Published: (2024)
by: Riou, Alain, et al.
Published: (2024)
Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement
by: Chen, Qianniu, et al.
Published: (2025)
by: Chen, Qianniu, et al.
Published: (2025)
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
by: Chen, Sijing, et al.
Published: (2024)
by: Chen, Sijing, et al.
Published: (2024)
Ensemble of classifiers for speech evaluation
by: Belokrylov, G., et al.
Published: (2024)
by: Belokrylov, G., et al.
Published: (2024)
OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching
by: Huynh-Nguyen, Hieu-Nghia, et al.
Published: (2025)
by: Huynh-Nguyen, Hieu-Nghia, et al.
Published: (2025)
End-to-end multi-channel speaker extraction and binaural speech synthesis
by: Chi, Cheng, et al.
Published: (2024)
by: Chi, Cheng, et al.
Published: (2024)
AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection
by: Gong, Rong, et al.
Published: (2024)
by: Gong, Rong, et al.
Published: (2024)
Heterogeneous bimodal attention fusion for speech emotion recognition
by: Luo, Jiachen, et al.
Published: (2025)
by: Luo, Jiachen, et al.
Published: (2025)
StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis
by: Chen, Zhiyong, et al.
Published: (2024)
by: Chen, Zhiyong, et al.
Published: (2024)
MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech
by: Bak, Taejun, et al.
Published: (2024)
by: Bak, Taejun, et al.
Published: (2024)
Automated evaluation of children's speech fluency for low-resource languages
by: Zhang, Bowen, et al.
Published: (2025)
by: Zhang, Bowen, et al.
Published: (2025)
Enhancing CTC-based speech recognition with diverse modeling units
by: Han, Shiyi, et al.
Published: (2024)
by: Han, Shiyi, et al.
Published: (2024)
FINALLY: fast and universal speech enhancement with studio-like quality
by: Babaev, Nicholas, et al.
Published: (2024)
by: Babaev, Nicholas, et al.
Published: (2024)
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
by: Deng, Wei, et al.
Published: (2025)
by: Deng, Wei, et al.
Published: (2025)
Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech
by: Kim, Taesoo, et al.
Published: (2025)
by: Kim, Taesoo, et al.
Published: (2025)
Online neural fusion of distortionless differential beamformers for robust speech enhancement
by: Qian, Yuanhang, et al.
Published: (2025)
by: Qian, Yuanhang, et al.
Published: (2025)
A correlation-permutation approach for speech-music encoders model merging
by: Ritter-Gutierrez, Fabian, et al.
Published: (2025)
by: Ritter-Gutierrez, Fabian, et al.
Published: (2025)
SPMamba: State-space model is all you need in speech separation
by: Li, Kai, et al.
Published: (2024)
by: Li, Kai, et al.
Published: (2024)
MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt
by: Wu, Zhichao, et al.
Published: (2025)
by: Wu, Zhichao, et al.
Published: (2025)
CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment
by: Liu, Hanwen, et al.
Published: (2026)
by: Liu, Hanwen, et al.
Published: (2026)
EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector
by: Cho, Deok-Hyeon, et al.
Published: (2024)
by: Cho, Deok-Hyeon, et al.
Published: (2024)
learning discriminative features from spectrograms using center loss for speech emotion recognition
by: Dai, Dongyang, et al.
Published: (2025)
by: Dai, Dongyang, et al.
Published: (2025)
Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions
by: Gao, Xiaoxue, et al.
Published: (2025)
by: Gao, Xiaoxue, et al.
Published: (2025)
SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios
by: Li, Kai, et al.
Published: (2024)
by: Li, Kai, et al.
Published: (2024)
Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
by: Hussain, Shehzeen, et al.
Published: (2025)
by: Hussain, Shehzeen, et al.
Published: (2025)
Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations
by: Yadav, Sarthak, et al.
Published: (2024)
by: Yadav, Sarthak, et al.
Published: (2024)
Erasing Your Voice Before It's Heard: Training-free Speaker Unlearning for Zero-shot Text-to-Speech
by: Lee, Myungjin, et al.
Published: (2026)
by: Lee, Myungjin, et al.
Published: (2026)
A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model
by: Zhao, Dongdi, et al.
Published: (2024)
by: Zhao, Dongdi, et al.
Published: (2024)
MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models
by: Zhang, Yixiao, et al.
Published: (2024)
by: Zhang, Yixiao, et al.
Published: (2024)
Aligning Text-to-Music Evaluation with Human Preferences
by: Huang, Yichen, et al.
Published: (2025)
by: Huang, Yichen, et al.
Published: (2025)
Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
by: Zhang, Yaoyun, et al.
Published: (2024)
by: Zhang, Yaoyun, et al.
Published: (2024)
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
by: Wang, Yuancheng, et al.
Published: (2024)
by: Wang, Yuancheng, et al.
Published: (2024)
Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance
by: Yang, Haijie, et al.
Published: (2025)
by: Yang, Haijie, et al.
Published: (2025)
Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition
by: Shi, Hao, et al.
Published: (2024)
by: Shi, Hao, et al.
Published: (2024)
Similar Items
-
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
by: Du, Zhihao, et al.
Published: (2024) -
R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion
by: Zheng, Junjie, et al.
Published: (2025) -
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
by: Li, Xiquan, et al.
Published: (2024) -
Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision
by: Jia, Zhijun, et al.
Published: (2024) -
Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications
by: Vecino, Biel Tura, et al.
Published: (2025)