:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Lovelace, Justin, Ray, Soham, Kim, Kwangyoun, Weinberger, Kilian Q., Wu, Felix
Format:	Preprint
Published:	2024
Subjects:	Sound Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2409.03717
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Music Transcription with (Almost) No Supervision
by: Shin, Saebyeol, et al.
Published: (2026)

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
by: Liu, Zhijun, et al.
Published: (2024)

DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion
by: Lou, Yuxuan, et al.
Published: (2026)

Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2
by: Mohanty, Suvendu Sekhar
Published: (2026)

DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors
by: Lee, Keon, et al.
Published: (2024)

LatentSpeech: Latent Diffusion for Text-To-Speech Generation
by: Lou, Haowei, et al.
Published: (2024)

FlashSpeech: Efficient Zero-Shot Speech Synthesis
by: Ye, Zhen, et al.
Published: (2024)

Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling
by: Belardi, Christian, et al.
Published: (2026)

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
by: Novack, Zachary, et al.
Published: (2026)

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
by: Ju, Zeqian, et al.
Published: (2024)

SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis
by: Zhang, Zhisheng, et al.
Published: (2025)

STTATTS: Unified Speech-To-Text And Text-To-Speech Model
by: Toyin, Hawau Olamide, et al.
Published: (2024)

Diffusion Guided Language Modeling
by: Lovelace, Justin, et al.
Published: (2024)

MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
by: Lou, Yuxuan, et al.
Published: (2026)

Collaborative Watermarking for Adversarial Speech Synthesis
by: Juvela, Lauri, et al.
Published: (2023)

Scaling Speech Tokenizers with Diffusion Autoencoders
by: Wang, Yuancheng, et al.
Published: (2026)

NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations
by: Liao, Huan, et al.
Published: (2025)

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
by: Wang, Yuancheng, et al.
Published: (2024)

E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
by: Zhang, Zhisheng, et al.
Published: (2025)

Split and Conquer Partial Deepfake Speech
by: Rimon, Inbal, et al.
Published: (2026)

Mitigating Unauthorized Speech Synthesis for Voice Protection
by: Zhang, Zhisheng, et al.
Published: (2024)

High-Resolution Speech Restoration with Latent Diffusion Model
by: Dhyani, Tushar, et al.
Published: (2024)

Single and Few-step Diffusion for Generative Speech Enhancement
by: Lay, Bunlong, et al.
Published: (2023)

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
by: Xin, Detai, et al.
Published: (2024)

Flowing Straighter with Conditional Flow Matching for Accurate Speech Enhancement
by: Cross, Mattias, et al.
Published: (2025)

Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization
by: Della Libera, Luca, et al.
Published: (2026)

Pre-training Feature Guided Diffusion Model for Speech Enhancement
by: Yang, Yiyuan, et al.
Published: (2024)

Diffused Responsibility: Analyzing the Energy Consumption of Generative Text-to-Audio Diffusion Models
by: Passoni, Riccardo, et al.
Published: (2025)

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
by: Peng, Puyuan, et al.
Published: (2024)

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
by: Kim, Ji-Hoon, et al.
Published: (2023)

DFKI-Speech System for WildSpoof Challenge: A robust framework for SASV In-the-Wild
by: Das, Arnab, et al.
Published: (2026)

Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features
by: Saurav, Kumar
Published: (2026)

kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech
by: Hajal, Karl El, et al.
Published: (2024)

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
by: Du, Zhihao, et al.
Published: (2024)

Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
by: Tal, Or, et al.
Published: (2025)

Diffuse or Confuse: A Diffusion Deepfake Speech Dataset
by: Firc, Anton, et al.
Published: (2024)

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation
by: Kamahori, Keisuke, et al.
Published: (2025)

Low-Resource Guidance for Controllable Latent Audio Diffusion
by: Novack, Zachary, et al.
Published: (2026)

Alternating Approach-Putt Models for Multi-Stage Speech Enhancement
by: Jeong, Iksoon, et al.
Published: (2025)

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
by: Jia, Dongya, et al.
Published: (2025)