Saved in:
| Main Authors: | Changin, Choi, Sungjun, Lim, Wonjong, Rhee |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.10913 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
by: Li, Xiquan, et al.
Published: (2024)
by: Li, Xiquan, et al.
Published: (2024)
RECAP: Retrieval-Augmented Audio Captioning
by: Ghosh, Sreyan, et al.
Published: (2023)
by: Ghosh, Sreyan, et al.
Published: (2023)
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
Retrieval-Augmented Text-to-Audio Generation
by: Yuan, Yi, et al.
Published: (2023)
by: Yuan, Yi, et al.
Published: (2023)
Retrieval-Augmented Audio Deepfake Detection
by: Kang, Zuheng, et al.
Published: (2024)
by: Kang, Zuheng, et al.
Published: (2024)
Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9
by: Lee, Do Hyun, et al.
Published: (2024)
by: Lee, Do Hyun, et al.
Published: (2024)
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
by: Chen, Shunian, et al.
Published: (2025)
by: Chen, Shunian, et al.
Published: (2025)
WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
by: Chen, Yifu, et al.
Published: (2025)
by: Chen, Yifu, et al.
Published: (2025)
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
Towards Generating Diverse Audio Captions via Adversarial Training
by: Mei, Xinhao, et al.
Published: (2022)
by: Mei, Xinhao, et al.
Published: (2022)
Evaluating Hallucinations in Audio-Visual Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions
by: Park, Hansol, et al.
Published: (2025)
by: Park, Hansol, et al.
Published: (2025)
MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions
by: Choi, Suhwan, et al.
Published: (2025)
by: Choi, Suhwan, et al.
Published: (2025)
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation
by: Xiao, Yujia, et al.
Published: (2025)
by: Xiao, Yujia, et al.
Published: (2025)
Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine
by: Kuznetsova, Anastasia, et al.
Published: (2025)
by: Kuznetsova, Anastasia, et al.
Published: (2025)
BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval
by: Lu, Zhenyu, et al.
Published: (2024)
by: Lu, Zhenyu, et al.
Published: (2024)
Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings
by: Rhyu, Seungyeon, et al.
Published: (2024)
by: Rhyu, Seungyeon, et al.
Published: (2024)
LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
by: Rho, Kyeongha, et al.
Published: (2025)
by: Rho, Kyeongha, et al.
Published: (2025)
UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
by: Choi, Woongjib, et al.
Published: (2025)
by: Choi, Woongjib, et al.
Published: (2025)
Augmentation through Laundering Attacks for Audio Spoof Detection
by: Ali, Hashim, et al.
Published: (2024)
by: Ali, Hashim, et al.
Published: (2024)
PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text
by: Bang, Hayeon, et al.
Published: (2024)
by: Bang, Hayeon, et al.
Published: (2024)
ModalityMirror: Improving Audio Classification in Modality Heterogeneity Federated Learning with Multimodal Distillation
by: Feng, Tiantian, et al.
Published: (2024)
by: Feng, Tiantian, et al.
Published: (2024)
Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation
by: Luong, Manh, et al.
Published: (2024)
by: Luong, Manh, et al.
Published: (2024)
Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis
by: Juvela, Lauri, et al.
Published: (2024)
by: Juvela, Lauri, et al.
Published: (2024)
Audio-Guided Fusion Techniques for Multimodal Emotion Analysis
by: Shi, Pujin, et al.
Published: (2024)
by: Shi, Pujin, et al.
Published: (2024)
AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion
by: Zhao, Junqi, et al.
Published: (2025)
by: Zhao, Junqi, et al.
Published: (2025)
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
by: Yuan, Yi, et al.
Published: (2025)
by: Yuan, Yi, et al.
Published: (2025)
GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems
by: Robatian, Amin, et al.
Published: (2025)
by: Robatian, Amin, et al.
Published: (2025)
MixAssist: An Audio-Language Dataset for Co-Creative AI Assistance in Music Mixing
by: Clemens, Michael, et al.
Published: (2025)
by: Clemens, Michael, et al.
Published: (2025)
Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
by: Xiong, Chenxu, et al.
Published: (2024)
by: Xiong, Chenxu, et al.
Published: (2024)
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
by: Erol, Mehmet Hamza, et al.
Published: (2024)
by: Erol, Mehmet Hamza, et al.
Published: (2024)
Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation
by: Xu, Jingyi, et al.
Published: (2024)
by: Xu, Jingyi, et al.
Published: (2024)
ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors
by: Yin, Yuguo, et al.
Published: (2025)
by: Yin, Yuguo, et al.
Published: (2025)
PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs
by: Dementyev, Artem, et al.
Published: (2026)
by: Dementyev, Artem, et al.
Published: (2026)
CoLLAP: Contrastive Long-form Language-Audio Pretraining with Musical Temporal Structure Augmentation
by: Wu, Junda, et al.
Published: (2024)
by: Wu, Junda, et al.
Published: (2024)
Audio Explanation Synthesis with Generative Foundation Models
by: Akman, Alican, et al.
Published: (2024)
by: Akman, Alican, et al.
Published: (2024)
ViSAGe: Video-to-Spatial Audio Generation
by: Kim, Jaeyeon, et al.
Published: (2025)
by: Kim, Jaeyeon, et al.
Published: (2025)
Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts
by: Jin, Hojun, et al.
Published: (2025)
by: Jin, Hojun, et al.
Published: (2025)
Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis
by: Zhang, Yucong, et al.
Published: (2024)
by: Zhang, Yucong, et al.
Published: (2024)
AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval
by: Kim, Hyun Jun, et al.
Published: (2025)
by: Kim, Hyun Jun, et al.
Published: (2025)
Similar Items
-
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
by: Li, Xiquan, et al.
Published: (2024) -
RECAP: Retrieval-Augmented Audio Captioning
by: Ghosh, Sreyan, et al.
Published: (2023) -
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024) -
Retrieval-Augmented Text-to-Audio Generation
by: Yuan, Yi, et al.
Published: (2023) -
Retrieval-Augmented Audio Deepfake Detection
by: Kang, Zuheng, et al.
Published: (2024)