:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yang, Mu, Shi, Bowen, Le, Matthew, Hsu, Wei-Ning, Tjandra, Andros
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Sound
Online Access:	https://arxiv.org/abs/2411.05141
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
by: Tjandra, Andros, et al.
Published: (2025)

Generative Pre-training for Speech with Flow Matching
by: Liu, Alexander H., et al.
Published: (2023)

MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
by: Prajwal, K R, et al.
Published: (2024)

The AudioMOS Challenge 2025
by: Huang, Wen-Chin, et al.
Published: (2025)

PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification
by: Seth, Ashish, et al.
Published: (2024)

TTA-Bench: A Comprehensive Benchmark for Evaluating Text-to-Audio Models
by: Wang, Hui, et al.
Published: (2025)

Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
by: Dixit, Satvik, et al.
Published: (2024)

RiTTA: Modeling Event Relations in Text-to-Audio Generation
by: He, Yuhang, et al.
Published: (2024)

MiMo-Audio: Audio Language Models are Few-Shot Learners
by: Core Team, et al.
Published: (2025)

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
by: Zhang, Leying, et al.
Published: (2026)

Zero-Shot Audio Captioning Using Soft and Hard Prompts
by: Zhang, Yiming, et al.
Published: (2024)

Zero-Shot Text-to-Speech from Continuous Text Streams
by: Dang, Trung, et al.
Published: (2024)

On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot Learning
by: Tavares, Tiago, et al.
Published: (2024)

ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
by: Ghosh, Sreyan, et al.
Published: (2024)

Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion
by: Manor, Hila, et al.
Published: (2024)

Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora
by: Nespoli, Francesco, et al.
Published: (2024)

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
by: Kim, Jaehyeon, et al.
Published: (2024)

Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
by: Chien, Chung-Ming, et al.
Published: (2024)

Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement
by: Dutta, Soumya, et al.
Published: (2024)

Can Quantized Audio Language Models Perform Zero-Shot Spoofing Detection?
by: Dutta, Bikash, et al.
Published: (2025)

Zero-Shot Fake Video Detection by Audio-Visual Consistency
by: Li, Xiaolou, et al.
Published: (2024)

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
by: Kong, Zhifeng, et al.
Published: (2024)

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
by: Lei, Shun, et al.
Published: (2023)

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
by: Wang, Chunhui, et al.
Published: (2024)

Improving Rare-Word Recognition of Whisper in Zero-Shot Settings
by: Jogi, Yash, et al.
Published: (2025)

Retrieval-Augmented Text-to-Audio Generation
by: Yuan, Yi, et al.
Published: (2023)

Zero-Shot Text-to-Speech for Vietnamese
by: Vu, Thi, et al.
Published: (2025)

PALM: Few-Shot Prompt Learning for Audio Language Models
by: Hanif, Asif, et al.
Published: (2024)

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
by: Zhang, Bowen, et al.
Published: (2025)

ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
by: Zhu, Han, et al.
Published: (2025)

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
by: Bai, Yatong, et al.
Published: (2023)

WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
by: Chen, Yifu, et al.
Published: (2025)

Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation
by: Han, Changjin, et al.
Published: (2024)

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
by: Wang, Tianrui, et al.
Published: (2025)

On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification
by: Heggan, Calum, et al.
Published: (2024)

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
by: Wu, Shih-Lun, et al.
Published: (2023)

AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval
by: Lin, Jingru, et al.
Published: (2026)

Multi-label Zero-Shot Audio Classification with Temporal Attention
by: Dogan, Duygu, et al.
Published: (2024)

Self-supervised Learning for Acoustic Few-Shot Classification
by: Liang, Jingyong, et al.
Published: (2024)