Saved in:
| Main Authors: | Han, Changjin, Lee, Seokgi, Nam, Gyuhyeon, Chae, Gyeongsu |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.09311 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor
by: Lee, Seokgi, et al.
Published: (2025)
by: Lee, Seokgi, et al.
Published: (2025)
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
by: Lei, Shun, et al.
Published: (2023)
by: Lei, Shun, et al.
Published: (2023)
HiFi-Glot: High-Fidelity Neural Formant Synthesis with Differentiable Resonant Filters
by: Gu, Yicheng, et al.
Published: (2024)
by: Gu, Yicheng, et al.
Published: (2024)
Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQ
by: Chae, Yunkee, et al.
Published: (2025)
by: Chae, Yunkee, et al.
Published: (2025)
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
by: Li, Yinghao Aaron, et al.
Published: (2024)
by: Li, Yinghao Aaron, et al.
Published: (2024)
SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis
by: Wang, Helin, et al.
Published: (2024)
by: Wang, Helin, et al.
Published: (2024)
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
by: Kim, Jaehyeon, et al.
Published: (2024)
by: Kim, Jaehyeon, et al.
Published: (2024)
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
by: Jiang, Yuepeng, et al.
Published: (2024)
by: Jiang, Yuepeng, et al.
Published: (2024)
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
by: Han, Bing, et al.
Published: (2024)
by: Han, Bing, et al.
Published: (2024)
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2023)
by: Jiang, Ziyue, et al.
Published: (2023)
Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
by: Wang, Tianrui, et al.
Published: (2025)
by: Wang, Tianrui, et al.
Published: (2025)
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
by: Li, Yingahao Aaron, et al.
Published: (2024)
by: Li, Yingahao Aaron, et al.
Published: (2024)
Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference
by: Dai, Shuqi, et al.
Published: (2025)
by: Dai, Shuqi, et al.
Published: (2025)
Zero-Shot Mono-to-Binaural Speech Synthesis
by: Levkovitch, Alon, et al.
Published: (2024)
by: Levkovitch, Alon, et al.
Published: (2024)
CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models
by: Chen, Junyang, et al.
Published: (2026)
by: Chen, Junyang, et al.
Published: (2026)
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)
by: Xue, Jinlong, et al.
Published: (2024)
DIFFRENT: A Diffusion Model for Recording Environment Transfer of Speech
by: Im, Jaekwon, et al.
Published: (2024)
by: Im, Jaekwon, et al.
Published: (2024)
SF-Speech: Straightened Flow for Zero-Shot Voice Clone
by: Li, Xuyuan, et al.
Published: (2024)
by: Li, Xuyuan, et al.
Published: (2024)
Zero-Shot Text-to-Speech from Continuous Text Streams
by: Dang, Trung, et al.
Published: (2024)
by: Dang, Trung, et al.
Published: (2024)
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
by: Zhu, Han, et al.
Published: (2025)
by: Zhu, Han, et al.
Published: (2025)
Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation
by: Yang, Mu, et al.
Published: (2024)
by: Yang, Mu, et al.
Published: (2024)
Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora
by: Nespoli, Francesco, et al.
Published: (2024)
by: Nespoli, Francesco, et al.
Published: (2024)
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
by: Nishimura, Yuto, et al.
Published: (2024)
by: Nishimura, Yuto, et al.
Published: (2024)
Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS
by: Seo, Deokjin, et al.
Published: (2026)
by: Seo, Deokjin, et al.
Published: (2026)
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech
by: Ji, Shengpeng, et al.
Published: (2024)
by: Ji, Shengpeng, et al.
Published: (2024)
Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction
by: Zhang, Leying, et al.
Published: (2025)
by: Zhang, Leying, et al.
Published: (2025)
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
by: Zhang, Bowen, et al.
Published: (2025)
by: Zhang, Bowen, et al.
Published: (2025)
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2025)
by: Jiang, Ziyue, et al.
Published: (2025)
Parallel Synthesis for Autoregressive Speech Generation
by: Hsu, Po-chun, et al.
Published: (2022)
by: Hsu, Po-chun, et al.
Published: (2022)
Instance-Specific Test-Time Training for Speech Editing in the Wild
by: Kim, Taewoo, et al.
Published: (2025)
by: Kim, Taewoo, et al.
Published: (2025)
HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
by: Wang, Chunhui, et al.
Published: (2024)
by: Wang, Chunhui, et al.
Published: (2024)
Improving Rare-Word Recognition of Whisper in Zero-Shot Settings
by: Jogi, Yash, et al.
Published: (2025)
by: Jogi, Yash, et al.
Published: (2025)
Zero-Shot Duet Singing Voices Separation with Diffusion Models
by: Yu, Chin-Yun, et al.
Published: (2023)
by: Yu, Chin-Yun, et al.
Published: (2023)
Zero-Shot Recognition of Dysarthric Speech Using Commercial Automatic Speech Recognition and Multimodal Large Language Models
by: Alsayegh, Ali, et al.
Published: (2025)
by: Alsayegh, Ali, et al.
Published: (2025)
Transliterated Zero-Shot Domain Adaptation for Automatic Speech Recognition
by: Zhu, Han, et al.
Published: (2024)
by: Zhu, Han, et al.
Published: (2024)
Multimodal Zero-Shot Framework for Deepfake Hate Speech Detection in Low-Resource Languages
by: Ranjan, Rishabh, et al.
Published: (2025)
by: Ranjan, Rishabh, et al.
Published: (2025)
Few-step Adversarial Schrödinger Bridge for Generative Speech Enhancement
by: Han, Seungu, et al.
Published: (2025)
by: Han, Seungu, et al.
Published: (2025)
Zero-Shot Text-to-Speech for Vietnamese
by: Vu, Thi, et al.
Published: (2025)
by: Vu, Thi, et al.
Published: (2025)
MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction
by: Chae, Yunkee, et al.
Published: (2025)
by: Chae, Yunkee, et al.
Published: (2025)
Similar Items
-
GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor
by: Lee, Seokgi, et al.
Published: (2025) -
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
by: Lei, Shun, et al.
Published: (2023) -
HiFi-Glot: High-Fidelity Neural Formant Synthesis with Differentiable Resonant Filters
by: Gu, Yicheng, et al.
Published: (2024) -
Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQ
by: Chae, Yunkee, et al.
Published: (2025) -
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
by: Li, Yinghao Aaron, et al.
Published: (2024)