:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kumar, Sonal, Seetharaman, Prem, Chen, Ke, Nieto, Oriol, Su, Jiaqi, Wang, Zhepei, Kumar, Rithesh, Manocha, Dinesh, Bryan, Nicholas J., Jin, Zeyu, Salamon, Justin
Format:	Preprint
Published:	2026
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2602.15766
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation
by: Kumar, Sonal, et al.
Published: (2024)

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing
by: Chen, William, et al.
Published: (2026)

Generative Audio Extension and Morphing
by: Seetharaman, Prem, et al.
Published: (2026)

PromptSep: Generative Audio Separation via Multimodal Prompting
by: Wen, Yutong, et al.
Published: (2025)

Audiocards: Structured Metadata Improves Audio Language Models For Sound Design
by: Sridhar, Sripathi, et al.
Published: (2026)

Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations
by: García, Hugo Flores, et al.
Published: (2024)

Taming Audio VAEs via Target-KL Regularization
by: Seetharaman, Prem, et al.
Published: (2026)

Mix2Morph: Learning Sound Morphing from Noisy Mixes
by: Chu, Annie, et al.
Published: (2026)

Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models
by: Seth, Ashish, et al.
Published: (2026)

FLAM: Frame-Wise Language-Audio Modeling
by: Wu, Yusong, et al.
Published: (2025)

Code Drift: Towards Idempotent Neural Audio Codecs
by: O'Reilly, Patrick, et al.
Published: (2024)

Video-Guided Foley Sound Generation with Multimodal Controls
by: Chen, Ziyang, et al.
Published: (2024)

RECAP: Retrieval-Augmented Audio Captioning
by: Ghosh, Sreyan, et al.
Published: (2023)

ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
by: Ghosh, Sreyan, et al.
Published: (2024)

Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning
by: Manco, Ilaria, et al.
Published: (2024)

PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification
by: Seth, Ashish, et al.
Published: (2024)

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
by: Sakshi, S, et al.
Published: (2024)

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
by: Ghosh, Sreyan, et al.
Published: (2024)

AV-RIR: Audio-Visual Room Impulse Response Estimation
by: Ratnarajah, Anton, et al.
Published: (2023)

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
by: Ghosh, Sreyan, et al.
Published: (2023)

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
by: Ghosh, Sreyan, et al.
Published: (2025)

Rethinking Music Captioning with Music Metadata LLMs
by: Bukey, Irmak, et al.
Published: (2026)

Do Audio-Language Models Understand Linguistic Variations?
by: Selvakumar, Ramaneswaran, et al.
Published: (2024)

DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers
by: Guimarães, Heitor R., et al.
Published: (2025)

A Generative-First Neural Audio Autoencoder
by: Casebeer, Jonah, et al.
Published: (2026)

DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
by: Li, Yingahao Aaron, et al.
Published: (2024)

Deep Audio Watermarks are Shallow: Limitations of Post-Hoc Watermarking Techniques for Speech
by: O'Reilly, Patrick, et al.
Published: (2025)

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
by: Goel, Arushi, et al.
Published: (2025)

Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning
by: Yang, Chao-Han Huck, et al.
Published: (2025)

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning
by: Seth, Ashish, et al.
Published: (2024)

The Rhythm In Anything: Audio-Prompted Drums Generation with Masked Language Modeling
by: O'Reilly, Patrick, et al.
Published: (2025)

PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
by: Xie, Zeyu, et al.
Published: (2024)

AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech
by: Qiu, Jielin, et al.
Published: (2026)

Do Audio-Visual Large Language Models Really See and Hear?
by: Selvakumar, Ramaneswaran, et al.
Published: (2026)

Enhance Temporal Relations in Audio Captioning with Sound Event Detection
by: Xie, Zeyu, et al.
Published: (2023)

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs
by: Ghosh, Sreyan, et al.
Published: (2024)

TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification
by: Anand, Nishit, et al.
Published: (2024)

On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot Learning
by: Tavares, Tiago, et al.
Published: (2024)

Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation
by: Chen, Ke, et al.
Published: (2024)

Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes
by: Ratnarajah, Anton, et al.
Published: (2023)