Saved in:
| Main Authors: | Lovelace, Justin, Ray, Soham, Kim, Kwangyoun, Weinberger, Kilian Q., Wu, Felix |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.03717 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Music Transcription with (Almost) No Supervision
by: Shin, Saebyeol, et al.
Published: (2026)
by: Shin, Saebyeol, et al.
Published: (2026)
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
by: Liu, Zhijun, et al.
Published: (2024)
by: Liu, Zhijun, et al.
Published: (2024)
DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion
by: Lou, Yuxuan, et al.
Published: (2026)
by: Lou, Yuxuan, et al.
Published: (2026)
Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2
by: Mohanty, Suvendu Sekhar
Published: (2026)
by: Mohanty, Suvendu Sekhar
Published: (2026)
DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors
by: Lee, Keon, et al.
Published: (2024)
by: Lee, Keon, et al.
Published: (2024)
LatentSpeech: Latent Diffusion for Text-To-Speech Generation
by: Lou, Haowei, et al.
Published: (2024)
by: Lou, Haowei, et al.
Published: (2024)
FlashSpeech: Efficient Zero-Shot Speech Synthesis
by: Ye, Zhen, et al.
Published: (2024)
by: Ye, Zhen, et al.
Published: (2024)
Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling
by: Belardi, Christian, et al.
Published: (2026)
by: Belardi, Christian, et al.
Published: (2026)
Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
by: Novack, Zachary, et al.
Published: (2026)
by: Novack, Zachary, et al.
Published: (2026)
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
by: Ju, Zeqian, et al.
Published: (2024)
by: Ju, Zeqian, et al.
Published: (2024)
SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis
by: Zhang, Zhisheng, et al.
Published: (2025)
by: Zhang, Zhisheng, et al.
Published: (2025)
STTATTS: Unified Speech-To-Text And Text-To-Speech Model
by: Toyin, Hawau Olamide, et al.
Published: (2024)
by: Toyin, Hawau Olamide, et al.
Published: (2024)
Diffusion Guided Language Modeling
by: Lovelace, Justin, et al.
Published: (2024)
by: Lovelace, Justin, et al.
Published: (2024)
MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
by: Lou, Yuxuan, et al.
Published: (2026)
by: Lou, Yuxuan, et al.
Published: (2026)
Collaborative Watermarking for Adversarial Speech Synthesis
by: Juvela, Lauri, et al.
Published: (2023)
by: Juvela, Lauri, et al.
Published: (2023)
Scaling Speech Tokenizers with Diffusion Autoencoders
by: Wang, Yuancheng, et al.
Published: (2026)
by: Wang, Yuancheng, et al.
Published: (2026)
NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations
by: Liao, Huan, et al.
Published: (2025)
by: Liao, Huan, et al.
Published: (2025)
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
by: Wang, Yuancheng, et al.
Published: (2024)
by: Wang, Yuancheng, et al.
Published: (2024)
E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
by: Zhang, Zhisheng, et al.
Published: (2025)
by: Zhang, Zhisheng, et al.
Published: (2025)
Split and Conquer Partial Deepfake Speech
by: Rimon, Inbal, et al.
Published: (2026)
by: Rimon, Inbal, et al.
Published: (2026)
Mitigating Unauthorized Speech Synthesis for Voice Protection
by: Zhang, Zhisheng, et al.
Published: (2024)
by: Zhang, Zhisheng, et al.
Published: (2024)
High-Resolution Speech Restoration with Latent Diffusion Model
by: Dhyani, Tushar, et al.
Published: (2024)
by: Dhyani, Tushar, et al.
Published: (2024)
Single and Few-step Diffusion for Generative Speech Enhancement
by: Lay, Bunlong, et al.
Published: (2023)
by: Lay, Bunlong, et al.
Published: (2023)
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
by: Xin, Detai, et al.
Published: (2024)
by: Xin, Detai, et al.
Published: (2024)
Flowing Straighter with Conditional Flow Matching for Accurate Speech Enhancement
by: Cross, Mattias, et al.
Published: (2025)
by: Cross, Mattias, et al.
Published: (2025)
Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization
by: Della Libera, Luca, et al.
Published: (2026)
by: Della Libera, Luca, et al.
Published: (2026)
Pre-training Feature Guided Diffusion Model for Speech Enhancement
by: Yang, Yiyuan, et al.
Published: (2024)
by: Yang, Yiyuan, et al.
Published: (2024)
Diffused Responsibility: Analyzing the Energy Consumption of Generative Text-to-Audio Diffusion Models
by: Passoni, Riccardo, et al.
Published: (2025)
by: Passoni, Riccardo, et al.
Published: (2025)
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
by: Peng, Puyuan, et al.
Published: (2024)
by: Peng, Puyuan, et al.
Published: (2024)
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
by: Kim, Ji-Hoon, et al.
Published: (2023)
by: Kim, Ji-Hoon, et al.
Published: (2023)
DFKI-Speech System for WildSpoof Challenge: A robust framework for SASV In-the-Wild
by: Das, Arnab, et al.
Published: (2026)
by: Das, Arnab, et al.
Published: (2026)
Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features
by: Saurav, Kumar
Published: (2026)
by: Saurav, Kumar
Published: (2026)
kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech
by: Hajal, Karl El, et al.
Published: (2024)
by: Hajal, Karl El, et al.
Published: (2024)
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
by: Du, Zhihao, et al.
Published: (2024)
by: Du, Zhihao, et al.
Published: (2024)
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
by: Tal, Or, et al.
Published: (2025)
by: Tal, Or, et al.
Published: (2025)
Diffuse or Confuse: A Diffusion Deepfake Speech Dataset
by: Firc, Anton, et al.
Published: (2024)
by: Firc, Anton, et al.
Published: (2024)
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation
by: Kamahori, Keisuke, et al.
Published: (2025)
by: Kamahori, Keisuke, et al.
Published: (2025)
Low-Resource Guidance for Controllable Latent Audio Diffusion
by: Novack, Zachary, et al.
Published: (2026)
by: Novack, Zachary, et al.
Published: (2026)
Alternating Approach-Putt Models for Multi-Stage Speech Enhancement
by: Jeong, Iksoon, et al.
Published: (2025)
by: Jeong, Iksoon, et al.
Published: (2025)
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
by: Jia, Dongya, et al.
Published: (2025)
by: Jia, Dongya, et al.
Published: (2025)
Similar Items
-
Music Transcription with (Almost) No Supervision
by: Shin, Saebyeol, et al.
Published: (2026) -
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
by: Liu, Zhijun, et al.
Published: (2024) -
DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion
by: Lou, Yuxuan, et al.
Published: (2026) -
Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2
by: Mohanty, Suvendu Sekhar
Published: (2026) -
DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors
by: Lee, Keon, et al.
Published: (2024)