Saved in:
| Main Authors: | Yang, Mu, Shi, Bowen, Le, Matthew, Hsu, Wei-Ning, Tjandra, Andros |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.05141 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
by: Tjandra, Andros, et al.
Published: (2025)
by: Tjandra, Andros, et al.
Published: (2025)
Generative Pre-training for Speech with Flow Matching
by: Liu, Alexander H., et al.
Published: (2023)
by: Liu, Alexander H., et al.
Published: (2023)
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
by: Prajwal, K R, et al.
Published: (2024)
by: Prajwal, K R, et al.
Published: (2024)
The AudioMOS Challenge 2025
by: Huang, Wen-Chin, et al.
Published: (2025)
by: Huang, Wen-Chin, et al.
Published: (2025)
PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification
by: Seth, Ashish, et al.
Published: (2024)
by: Seth, Ashish, et al.
Published: (2024)
TTA-Bench: A Comprehensive Benchmark for Evaluating Text-to-Audio Models
by: Wang, Hui, et al.
Published: (2025)
by: Wang, Hui, et al.
Published: (2025)
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
by: Dixit, Satvik, et al.
Published: (2024)
by: Dixit, Satvik, et al.
Published: (2024)
RiTTA: Modeling Event Relations in Text-to-Audio Generation
by: He, Yuhang, et al.
Published: (2024)
by: He, Yuhang, et al.
Published: (2024)
MiMo-Audio: Audio Language Models are Few-Shot Learners
by: Core Team, et al.
Published: (2025)
by: Core Team, et al.
Published: (2025)
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
by: Zhang, Leying, et al.
Published: (2026)
by: Zhang, Leying, et al.
Published: (2026)
Zero-Shot Audio Captioning Using Soft and Hard Prompts
by: Zhang, Yiming, et al.
Published: (2024)
by: Zhang, Yiming, et al.
Published: (2024)
Zero-Shot Text-to-Speech from Continuous Text Streams
by: Dang, Trung, et al.
Published: (2024)
by: Dang, Trung, et al.
Published: (2024)
On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot Learning
by: Tavares, Tiago, et al.
Published: (2024)
by: Tavares, Tiago, et al.
Published: (2024)
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
by: Ghosh, Sreyan, et al.
Published: (2024)
by: Ghosh, Sreyan, et al.
Published: (2024)
Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion
by: Manor, Hila, et al.
Published: (2024)
by: Manor, Hila, et al.
Published: (2024)
Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora
by: Nespoli, Francesco, et al.
Published: (2024)
by: Nespoli, Francesco, et al.
Published: (2024)
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
by: Kim, Jaehyeon, et al.
Published: (2024)
by: Kim, Jaehyeon, et al.
Published: (2024)
Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
by: Chien, Chung-Ming, et al.
Published: (2024)
by: Chien, Chung-Ming, et al.
Published: (2024)
Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement
by: Dutta, Soumya, et al.
Published: (2024)
by: Dutta, Soumya, et al.
Published: (2024)
Can Quantized Audio Language Models Perform Zero-Shot Spoofing Detection?
by: Dutta, Bikash, et al.
Published: (2025)
by: Dutta, Bikash, et al.
Published: (2025)
Zero-Shot Fake Video Detection by Audio-Visual Consistency
by: Li, Xiaolou, et al.
Published: (2024)
by: Li, Xiaolou, et al.
Published: (2024)
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
by: Kong, Zhifeng, et al.
Published: (2024)
by: Kong, Zhifeng, et al.
Published: (2024)
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
by: Lei, Shun, et al.
Published: (2023)
by: Lei, Shun, et al.
Published: (2023)
HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
by: Wang, Chunhui, et al.
Published: (2024)
by: Wang, Chunhui, et al.
Published: (2024)
Improving Rare-Word Recognition of Whisper in Zero-Shot Settings
by: Jogi, Yash, et al.
Published: (2025)
by: Jogi, Yash, et al.
Published: (2025)
Retrieval-Augmented Text-to-Audio Generation
by: Yuan, Yi, et al.
Published: (2023)
by: Yuan, Yi, et al.
Published: (2023)
Zero-Shot Text-to-Speech for Vietnamese
by: Vu, Thi, et al.
Published: (2025)
by: Vu, Thi, et al.
Published: (2025)
PALM: Few-Shot Prompt Learning for Audio Language Models
by: Hanif, Asif, et al.
Published: (2024)
by: Hanif, Asif, et al.
Published: (2024)
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)
by: Xue, Jinlong, et al.
Published: (2024)
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
by: Zhang, Bowen, et al.
Published: (2025)
by: Zhang, Bowen, et al.
Published: (2025)
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
by: Zhu, Han, et al.
Published: (2025)
by: Zhu, Han, et al.
Published: (2025)
ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
by: Bai, Yatong, et al.
Published: (2023)
by: Bai, Yatong, et al.
Published: (2023)
WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
by: Chen, Yifu, et al.
Published: (2025)
by: Chen, Yifu, et al.
Published: (2025)
Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation
by: Han, Changjin, et al.
Published: (2024)
by: Han, Changjin, et al.
Published: (2024)
Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
by: Wang, Tianrui, et al.
Published: (2025)
by: Wang, Tianrui, et al.
Published: (2025)
On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification
by: Heggan, Calum, et al.
Published: (2024)
by: Heggan, Calum, et al.
Published: (2024)
Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
by: Wu, Shih-Lun, et al.
Published: (2023)
by: Wu, Shih-Lun, et al.
Published: (2023)
AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval
by: Lin, Jingru, et al.
Published: (2026)
by: Lin, Jingru, et al.
Published: (2026)
Multi-label Zero-Shot Audio Classification with Temporal Attention
by: Dogan, Duygu, et al.
Published: (2024)
by: Dogan, Duygu, et al.
Published: (2024)
Self-supervised Learning for Acoustic Few-Shot Classification
by: Liang, Jingyong, et al.
Published: (2024)
by: Liang, Jingyong, et al.
Published: (2024)
Similar Items
-
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
by: Tjandra, Andros, et al.
Published: (2025) -
Generative Pre-training for Speech with Flow Matching
by: Liu, Alexander H., et al.
Published: (2023) -
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
by: Prajwal, K R, et al.
Published: (2024) -
The AudioMOS Challenge 2025
by: Huang, Wen-Chin, et al.
Published: (2025) -
PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification
by: Seth, Ashish, et al.
Published: (2024)