Saved in:
| Main Authors: | Kumar, Sonal, Seetharaman, Prem, Chen, Ke, Nieto, Oriol, Su, Jiaqi, Wang, Zhepei, Kumar, Rithesh, Manocha, Dinesh, Bryan, Nicholas J., Jin, Zeyu, Salamon, Justin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.15766 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation
by: Kumar, Sonal, et al.
Published: (2024)
by: Kumar, Sonal, et al.
Published: (2024)
AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing
by: Chen, William, et al.
Published: (2026)
by: Chen, William, et al.
Published: (2026)
Generative Audio Extension and Morphing
by: Seetharaman, Prem, et al.
Published: (2026)
by: Seetharaman, Prem, et al.
Published: (2026)
PromptSep: Generative Audio Separation via Multimodal Prompting
by: Wen, Yutong, et al.
Published: (2025)
by: Wen, Yutong, et al.
Published: (2025)
Audiocards: Structured Metadata Improves Audio Language Models For Sound Design
by: Sridhar, Sripathi, et al.
Published: (2026)
by: Sridhar, Sripathi, et al.
Published: (2026)
Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations
by: García, Hugo Flores, et al.
Published: (2024)
by: García, Hugo Flores, et al.
Published: (2024)
Taming Audio VAEs via Target-KL Regularization
by: Seetharaman, Prem, et al.
Published: (2026)
by: Seetharaman, Prem, et al.
Published: (2026)
Mix2Morph: Learning Sound Morphing from Noisy Mixes
by: Chu, Annie, et al.
Published: (2026)
by: Chu, Annie, et al.
Published: (2026)
Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models
by: Seth, Ashish, et al.
Published: (2026)
by: Seth, Ashish, et al.
Published: (2026)
FLAM: Frame-Wise Language-Audio Modeling
by: Wu, Yusong, et al.
Published: (2025)
by: Wu, Yusong, et al.
Published: (2025)
Code Drift: Towards Idempotent Neural Audio Codecs
by: O'Reilly, Patrick, et al.
Published: (2024)
by: O'Reilly, Patrick, et al.
Published: (2024)
Video-Guided Foley Sound Generation with Multimodal Controls
by: Chen, Ziyang, et al.
Published: (2024)
by: Chen, Ziyang, et al.
Published: (2024)
RECAP: Retrieval-Augmented Audio Captioning
by: Ghosh, Sreyan, et al.
Published: (2023)
by: Ghosh, Sreyan, et al.
Published: (2023)
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
by: Ghosh, Sreyan, et al.
Published: (2024)
by: Ghosh, Sreyan, et al.
Published: (2024)
Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning
by: Manco, Ilaria, et al.
Published: (2024)
by: Manco, Ilaria, et al.
Published: (2024)
PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification
by: Seth, Ashish, et al.
Published: (2024)
by: Seth, Ashish, et al.
Published: (2024)
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
by: Sakshi, S, et al.
Published: (2024)
by: Sakshi, S, et al.
Published: (2024)
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
by: Ghosh, Sreyan, et al.
Published: (2024)
by: Ghosh, Sreyan, et al.
Published: (2024)
AV-RIR: Audio-Visual Room Impulse Response Estimation
by: Ratnarajah, Anton, et al.
Published: (2023)
by: Ratnarajah, Anton, et al.
Published: (2023)
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
by: Ghosh, Sreyan, et al.
Published: (2023)
by: Ghosh, Sreyan, et al.
Published: (2023)
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
by: Ghosh, Sreyan, et al.
Published: (2025)
by: Ghosh, Sreyan, et al.
Published: (2025)
Rethinking Music Captioning with Music Metadata LLMs
by: Bukey, Irmak, et al.
Published: (2026)
by: Bukey, Irmak, et al.
Published: (2026)
Do Audio-Language Models Understand Linguistic Variations?
by: Selvakumar, Ramaneswaran, et al.
Published: (2024)
by: Selvakumar, Ramaneswaran, et al.
Published: (2024)
DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers
by: Guimarães, Heitor R., et al.
Published: (2025)
by: Guimarães, Heitor R., et al.
Published: (2025)
A Generative-First Neural Audio Autoencoder
by: Casebeer, Jonah, et al.
Published: (2026)
by: Casebeer, Jonah, et al.
Published: (2026)
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
by: Li, Yingahao Aaron, et al.
Published: (2024)
by: Li, Yingahao Aaron, et al.
Published: (2024)
Deep Audio Watermarks are Shallow: Limitations of Post-Hoc Watermarking Techniques for Speech
by: O'Reilly, Patrick, et al.
Published: (2025)
by: O'Reilly, Patrick, et al.
Published: (2025)
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
by: Goel, Arushi, et al.
Published: (2025)
by: Goel, Arushi, et al.
Published: (2025)
Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning
by: Yang, Chao-Han Huck, et al.
Published: (2025)
by: Yang, Chao-Han Huck, et al.
Published: (2025)
EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning
by: Seth, Ashish, et al.
Published: (2024)
by: Seth, Ashish, et al.
Published: (2024)
The Rhythm In Anything: Audio-Prompted Drums Generation with Masked Language Modeling
by: O'Reilly, Patrick, et al.
Published: (2025)
by: O'Reilly, Patrick, et al.
Published: (2025)
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
by: Xie, Zeyu, et al.
Published: (2024)
by: Xie, Zeyu, et al.
Published: (2024)
AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech
by: Qiu, Jielin, et al.
Published: (2026)
by: Qiu, Jielin, et al.
Published: (2026)
Do Audio-Visual Large Language Models Really See and Hear?
by: Selvakumar, Ramaneswaran, et al.
Published: (2026)
by: Selvakumar, Ramaneswaran, et al.
Published: (2026)
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
by: Xie, Zeyu, et al.
Published: (2023)
by: Xie, Zeyu, et al.
Published: (2023)
Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs
by: Ghosh, Sreyan, et al.
Published: (2024)
by: Ghosh, Sreyan, et al.
Published: (2024)
TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification
by: Anand, Nishit, et al.
Published: (2024)
by: Anand, Nishit, et al.
Published: (2024)
On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot Learning
by: Tavares, Tiago, et al.
Published: (2024)
by: Tavares, Tiago, et al.
Published: (2024)
Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation
by: Chen, Ke, et al.
Published: (2024)
by: Chen, Ke, et al.
Published: (2024)
Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes
by: Ratnarajah, Anton, et al.
Published: (2023)
by: Ratnarajah, Anton, et al.
Published: (2023)
Similar Items
-
SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation
by: Kumar, Sonal, et al.
Published: (2024) -
AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing
by: Chen, William, et al.
Published: (2026) -
Generative Audio Extension and Morphing
by: Seetharaman, Prem, et al.
Published: (2026) -
PromptSep: Generative Audio Separation via Multimodal Prompting
by: Wen, Yutong, et al.
Published: (2025) -
Audiocards: Structured Metadata Improves Audio Language Models For Sound Design
by: Sridhar, Sripathi, et al.
Published: (2026)