:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wijngaard, Gijs, Formisano, Elia, Giordano, Bruno L., Dumontier, Michel
Format:	Preprint
Published:	2024
Subjects:	Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2403.18572
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound
by: Wijngaard, Gijs, et al.
Published: (2025)

Audio-Language Datasets of Scenes and Events: A Survey
by: Wijngaard, Gijs, et al.
Published: (2024)

Data-Balanced Curriculum Learning for Audio Question Answering
by: Wijngaard, Gijs, et al.
Published: (2025)

AudioToolAgent: An Agentic Framework for Audio-Language Models
by: Wijngaard, Gijs, et al.
Published: (2025)

Discrete Audio Representations for Automated Audio Captioning
by: Tian, Jingguang, et al.
Published: (2025)

Enhance Temporal Relations in Audio Captioning with Sound Event Detection
by: Xie, Zeyu, et al.
Published: (2023)

MACE: Leveraging Audio for Evaluating Audio Captioning Systems
by: Dixit, Satvik, et al.
Published: (2024)

CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer
by: Takeuchi, Daiki, et al.
Published: (2025)

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
by: Yuan, Yi, et al.
Published: (2024)

SoundCollage: Automated Discovery of New Classes in Audio Datasets
by: Choi, Ryuhaerang, et al.
Published: (2024)

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
by: Liu, Jizhong, et al.
Published: (2024)

Resource-Efficient Reference-Free Evaluation of Audio Captions
by: Mahfuz, Rehana, et al.
Published: (2024)

Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)

MiDashengLM: Efficient Audio Understanding with General Audio Captions
by: Dinkel, Heinrich, et al.
Published: (2025)

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions
by: Zhu, Xinfa, et al.
Published: (2025)

Efficient Audio Captioning with Encoder-Level Knowledge Distillation
by: Xu, Xuenan, et al.
Published: (2024)

Construction and Analysis of Impression Caption Dataset for Environmental Sounds
by: Okamoto, Yuki, et al.
Published: (2024)

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)

AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio embedding Sequences
by: Kishi, Minoru, et al.
Published: (2025)

Evaluating CNN with Stacked Feature Representations and Audio Spectrogram Transformer Models for Sound Classification
by: Dehaghania, Parinaz Binandeh, et al.
Published: (2026)

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
by: Wu, Shih-Lun, et al.
Published: (2023)

Zero-Shot Audio Captioning Using Soft and Hard Prompts
by: Zhang, Yiming, et al.
Published: (2024)

SemanticAudio: Audio Generation and Editing in Semantic Space
by: Dai, Zheqi, et al.
Published: (2026)

Retrieval-Augmented Approach for Unsupervised Anomalous Sound Detection and Captioning without Model Training
by: Ogura, Ryoya, et al.
Published: (2024)

SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model
by: Hernandez-Olivan, Carlos, et al.
Published: (2024)

From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs
by: Jia, Yuhang, et al.
Published: (2025)

A Generalist Audio Foundation Model for Comprehensive Body Sound Auscultation
by: Wang, Pingjie, et al.
Published: (2024)

Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection
by: Han, Bing, et al.
Published: (2025)

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer
by: Wang, Helin, et al.
Published: (2024)

AudioSpa: Spatializing Sound Events with Text
by: Feng, Linfeng, et al.
Published: (2025)

Region-Specific Audio Tagging for Spatial Sound
by: Zhao, Jinzheng, et al.
Published: (2025)

Baseline Systems and Evaluation Metrics for Spatial Semantic Segmentation of Sound Scenes
by: Nguyen, Binh Thien, et al.
Published: (2025)

EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
by: Kim, Jaeyeon, et al.
Published: (2024)

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
by: Chen, Wenxi, et al.
Published: (2024)

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
by: Xin, Yifei, et al.
Published: (2023)

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
by: Wu, Yusong, et al.
Published: (2022)

Soundscape Captioning using Sound Affective Quality Network and Large Language Model
by: Hou, Yuanbo, et al.
Published: (2024)

Aligning Audio Captions with Human Preferences
by: Hegde, Kartik, et al.
Published: (2025)

Exploring the Potential of Data-Driven Spatial Audio Enhancement Using a Single-Channel Model
by: Santos, Arthur N. dos, et al.
Published: (2024)

Effective Pre-Training of Audio Transformers for Sound Event Detection
by: Schmid, Florian, et al.
Published: (2024)