:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zheng, John, Maleki, Farhad
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing Artificial Intelligence Sound
Online Access:	https://arxiv.org/abs/2509.19668
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
by: Du, Zhihao, et al.
Published: (2024)

R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion
by: Zheng, Junjie, et al.
Published: (2025)

DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
by: Li, Xiquan, et al.
Published: (2024)

Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision
by: Jia, Zhijun, et al.
Published: (2024)

Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications
by: Vecino, Biel Tura, et al.
Published: (2025)

EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion
by: Joglekar, Advait, et al.
Published: (2025)

Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures
by: Riou, Alain, et al.
Published: (2024)

Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement
by: Chen, Qianniu, et al.
Published: (2025)

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
by: Chen, Sijing, et al.
Published: (2024)

Ensemble of classifiers for speech evaluation
by: Belokrylov, G., et al.
Published: (2024)

OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching
by: Huynh-Nguyen, Hieu-Nghia, et al.
Published: (2025)

End-to-end multi-channel speaker extraction and binaural speech synthesis
by: Chi, Cheng, et al.
Published: (2024)

AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection
by: Gong, Rong, et al.
Published: (2024)

Heterogeneous bimodal attention fusion for speech emotion recognition
by: Luo, Jiachen, et al.
Published: (2025)

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis
by: Chen, Zhiyong, et al.
Published: (2024)

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech
by: Bak, Taejun, et al.
Published: (2024)

Automated evaluation of children's speech fluency for low-resource languages
by: Zhang, Bowen, et al.
Published: (2025)

Enhancing CTC-based speech recognition with diverse modeling units
by: Han, Shiyi, et al.
Published: (2024)

FINALLY: fast and universal speech enhancement with studio-like quality
by: Babaev, Nicholas, et al.
Published: (2024)

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
by: Deng, Wei, et al.
Published: (2025)

Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech
by: Kim, Taesoo, et al.
Published: (2025)

Online neural fusion of distortionless differential beamformers for robust speech enhancement
by: Qian, Yuanhang, et al.
Published: (2025)

A correlation-permutation approach for speech-music encoders model merging
by: Ritter-Gutierrez, Fabian, et al.
Published: (2025)

SPMamba: State-space model is all you need in speech separation
by: Li, Kai, et al.
Published: (2024)

MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt
by: Wu, Zhichao, et al.
Published: (2025)

CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment
by: Liu, Hanwen, et al.
Published: (2026)

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector
by: Cho, Deok-Hyeon, et al.
Published: (2024)

learning discriminative features from spectrograms using center loss for speech emotion recognition
by: Dai, Dongyang, et al.
Published: (2025)

Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions
by: Gao, Xiaoxue, et al.
Published: (2025)

SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios
by: Li, Kai, et al.
Published: (2024)

Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
by: Hussain, Shehzeen, et al.
Published: (2025)

Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations
by: Yadav, Sarthak, et al.
Published: (2024)

Erasing Your Voice Before It's Heard: Training-free Speaker Unlearning for Zero-shot Text-to-Speech
by: Lee, Myungjin, et al.
Published: (2026)

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model
by: Zhao, Dongdi, et al.
Published: (2024)

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models
by: Zhang, Yixiao, et al.
Published: (2024)

Aligning Text-to-Music Evaluation with Human Preferences
by: Huang, Yichen, et al.
Published: (2025)

Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
by: Zhang, Yaoyun, et al.
Published: (2024)

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
by: Wang, Yuancheng, et al.
Published: (2024)

Warm Chat: Diffuse Emotion-aware Interactive Talking Head Avatar with Tree-Structured Guidance
by: Yang, Haijie, et al.
Published: (2025)

Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition
by: Shi, Hao, et al.
Published: (2024)