Saved in:
| Main Authors: | Lee, Taehan, Lee, Hyukjun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.01690 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SAM: A Mamba-2 State-Space Audio-Language Model
by: Lee, Taehan, et al.
Published: (2025)
by: Lee, Taehan, et al.
Published: (2025)
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
by: Nguyen, Tan Dat, et al.
Published: (2024)
by: Nguyen, Tan Dat, et al.
Published: (2024)
DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization
by: Lee, Geonyoung, et al.
Published: (2025)
by: Lee, Geonyoung, et al.
Published: (2025)
VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation
by: Kim, Yubin, et al.
Published: (2025)
by: Kim, Yubin, et al.
Published: (2025)
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance
by: Milling, Manuel, et al.
Published: (2024)
by: Milling, Manuel, et al.
Published: (2024)
Representation-Regularized Convolutional Audio Transformer for Audio Understanding
by: Han, Bing, et al.
Published: (2026)
by: Han, Bing, et al.
Published: (2026)
Towards Unified Neural Decoding of Perceived, Spoken and Imagined Speech from EEG Signals
by: Lee, Jung-Sun, et al.
Published: (2024)
by: Lee, Jung-Sun, et al.
Published: (2024)
Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9
by: Lee, Do Hyun, et al.
Published: (2024)
by: Lee, Do Hyun, et al.
Published: (2024)
Decoding Ambiguous Emotions with Test-Time Scaling in Audio-Language Models
by: Jia, Hong, et al.
Published: (2026)
by: Jia, Hong, et al.
Published: (2026)
Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion
by: Jang, Jaehyuk, et al.
Published: (2026)
by: Jang, Jaehyuk, et al.
Published: (2026)
Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
by: Chen, Shunian, et al.
Published: (2025)
by: Chen, Shunian, et al.
Published: (2025)
SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval
by: Lin, Yueqian, et al.
Published: (2024)
by: Lin, Yueqian, et al.
Published: (2024)
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
by: Li, Xiquan, et al.
Published: (2024)
by: Li, Xiquan, et al.
Published: (2024)
End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding
by: Zeng, Wei, et al.
Published: (2024)
by: Zeng, Wei, et al.
Published: (2024)
LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation
by: Zhang, Zhisheng, et al.
Published: (2026)
by: Zhang, Zhisheng, et al.
Published: (2026)
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
Discrete Audio Tokens: More Than a Survey!
by: Mousavi, Pooneh, et al.
Published: (2025)
by: Mousavi, Pooneh, et al.
Published: (2025)
CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting
by: Jin, Sichen, et al.
Published: (2024)
by: Jin, Sichen, et al.
Published: (2024)
ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions
by: Feng, Jiu, et al.
Published: (2024)
by: Feng, Jiu, et al.
Published: (2024)
AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks
by: Liang, Yun, et al.
Published: (2024)
by: Liang, Yun, et al.
Published: (2024)
Evaluating Hallucinations in Audio-Visual Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions
by: Park, Hansol, et al.
Published: (2025)
by: Park, Hansol, et al.
Published: (2025)
SepPrune: Structured Pruning for Efficient Deep Speech Separation
by: Li, Yuqi, et al.
Published: (2025)
by: Li, Yuqi, et al.
Published: (2025)
PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text
by: Bang, Hayeon, et al.
Published: (2024)
by: Bang, Hayeon, et al.
Published: (2024)
AudioBERT: Audio Knowledge Augmented Language Model
by: Ok, Hyunjong, et al.
Published: (2024)
by: Ok, Hyunjong, et al.
Published: (2024)
Region-Based Optimization in Continual Learning for Audio Deepfake Detection
by: Chen, Yujie, et al.
Published: (2024)
by: Chen, Yujie, et al.
Published: (2024)
AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models
by: Yang, Chih-Kai, et al.
Published: (2025)
by: Yang, Chih-Kai, et al.
Published: (2025)
Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation
by: Gállego, Gerard I., et al.
Published: (2024)
by: Gállego, Gerard I., et al.
Published: (2024)
Audio Mamba: Pretrained Audio State Space Model For Audio Tagging
by: Lin, Jiaju, et al.
Published: (2024)
by: Lin, Jiaju, et al.
Published: (2024)
HRTFformer: A Spatially-Aware Transformer for Individual HRTF Upsampling in Immersive Audio Rendering
by: Hu, Xuyi, et al.
Published: (2025)
by: Hu, Xuyi, et al.
Published: (2025)
CoughViT: A Self-Supervised Vision Transformer for Cough Audio Representation Learning
by: Luong, Justin, et al.
Published: (2025)
by: Luong, Justin, et al.
Published: (2025)
PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation
by: Xiao, Yujia, et al.
Published: (2025)
by: Xiao, Yujia, et al.
Published: (2025)
Affect Decoding in Phonated and Silent Speech Production from Surface EMG
by: Pistrosch, Simon, et al.
Published: (2026)
by: Pistrosch, Simon, et al.
Published: (2026)
A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection
by: Lee, Kyungbok, et al.
Published: (2024)
by: Lee, Kyungbok, et al.
Published: (2024)
Audio Atlas: Visualizing and Exploring Audio Datasets
by: Lanzendörfer, Luca A., et al.
Published: (2024)
by: Lanzendörfer, Luca A., et al.
Published: (2024)
The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition
by: Gao, Ming, et al.
Published: (2025)
by: Gao, Ming, et al.
Published: (2025)
LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
by: Rho, Kyeongha, et al.
Published: (2025)
by: Rho, Kyeongha, et al.
Published: (2025)
Detecting Music Performance Errors with Transformers
by: Chou, Benjamin Shiue-Hal, et al.
Published: (2025)
by: Chou, Benjamin Shiue-Hal, et al.
Published: (2025)
Similar Items
-
SAM: A Mamba-2 State-Space Audio-Language Model
by: Lee, Taehan, et al.
Published: (2025) -
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
by: Kim, Jaeyeon, et al.
Published: (2024) -
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
by: Nguyen, Tan Dat, et al.
Published: (2024) -
DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization
by: Lee, Geonyoung, et al.
Published: (2025) -
VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation
by: Kim, Yubin, et al.
Published: (2025)