:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Changin, Choi, Sungjun, Lim, Wonjong, Rhee
Format:	Preprint
Published:	2024
Subjects:	Sound Artificial Intelligence Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2410.10913
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
by: Li, Xiquan, et al.
Published: (2024)

RECAP: Retrieval-Augmented Audio Captioning
by: Ghosh, Sreyan, et al.
Published: (2023)

Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)

Retrieval-Augmented Text-to-Audio Generation
by: Yuan, Yi, et al.
Published: (2023)

Retrieval-Augmented Audio Deepfake Detection
by: Kang, Zuheng, et al.
Published: (2024)

Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9
by: Lee, Do Hyun, et al.
Published: (2024)

FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
by: Chen, Shunian, et al.
Published: (2025)

WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
by: Chen, Yifu, et al.
Published: (2025)

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)

Towards Generating Diverse Audio Captions via Adversarial Training
by: Mei, Xinhao, et al.
Published: (2022)

Evaluating Hallucinations in Audio-Visual Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions
by: Park, Hansol, et al.
Published: (2025)

MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions
by: Choi, Suhwan, et al.
Published: (2025)

EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
by: Kim, Jaeyeon, et al.
Published: (2024)

PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation
by: Xiao, Yujia, et al.
Published: (2025)

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine
by: Kuznetsova, Anastasia, et al.
Published: (2025)

BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval
by: Lu, Zhenyu, et al.
Published: (2024)

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings
by: Rhyu, Seungyeon, et al.
Published: (2024)

LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
by: Rho, Kyeongha, et al.
Published: (2025)

UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
by: Choi, Woongjib, et al.
Published: (2025)

Augmentation through Laundering Attacks for Audio Spoof Detection
by: Ali, Hashim, et al.
Published: (2024)

PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text
by: Bang, Hayeon, et al.
Published: (2024)

ModalityMirror: Improving Audio Classification in Modality Heterogeneity Federated Learning with Multimodal Distillation
by: Feng, Tiantian, et al.
Published: (2024)

Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation
by: Luong, Manh, et al.
Published: (2024)

Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis
by: Juvela, Lauri, et al.
Published: (2024)

Audio-Guided Fusion Techniques for Multimodal Emotion Analysis
by: Shi, Pujin, et al.
Published: (2024)

AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion
by: Zhao, Junqi, et al.
Published: (2025)

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
by: Yuan, Yi, et al.
Published: (2025)

GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems
by: Robatian, Amin, et al.
Published: (2025)

MixAssist: An Audio-Language Dataset for Co-Creative AI Assistance in Music Mixing
by: Clemens, Michael, et al.
Published: (2025)

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
by: Xiong, Chenxu, et al.
Published: (2024)

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
by: Erol, Mehmet Hamza, et al.
Published: (2024)

Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation
by: Xu, Jingyi, et al.
Published: (2024)

ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors
by: Yin, Yuguo, et al.
Published: (2025)

PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs
by: Dementyev, Artem, et al.
Published: (2026)

CoLLAP: Contrastive Long-form Language-Audio Pretraining with Musical Temporal Structure Augmentation
by: Wu, Junda, et al.
Published: (2024)

Audio Explanation Synthesis with Generative Foundation Models
by: Akman, Alican, et al.
Published: (2024)

ViSAGe: Video-to-Spatial Audio Generation
by: Kim, Jaeyeon, et al.
Published: (2025)

Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts
by: Jin, Hojun, et al.
Published: (2025)

Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis
by: Zhang, Yucong, et al.
Published: (2024)

AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval
by: Kim, Hyun Jun, et al.
Published: (2025)