Saved in:
| Main Authors: | Koh, Junyoung, Kim, Soo Yong, Choi, Gyu Hyeong, Choi, Yongwon |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.20891 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Jamendo-QA: A Large-Scale Music Question Answering Dataset
by: Koh, Junyoung, et al.
Published: (2025)
by: Koh, Junyoung, et al.
Published: (2025)
Instrumental Text-to-Music Generation with Auxiliary Conditioning Branches
by: Koh, Junyoung
Published: (2026)
by: Koh, Junyoung
Published: (2026)
Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering
by: Koh, Junyoung, et al.
Published: (2026)
by: Koh, Junyoung, et al.
Published: (2026)
AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval
by: Kim, Hyun Jun, et al.
Published: (2025)
by: Kim, Hyun Jun, et al.
Published: (2025)
Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
by: Choi, Jeongsoo, et al.
Published: (2025)
by: Choi, Jeongsoo, et al.
Published: (2025)
Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models
by: Choi, Jeongsoo, et al.
Published: (2023)
by: Choi, Jeongsoo, et al.
Published: (2023)
DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance
by: Yang, Jinhyeok, et al.
Published: (2024)
by: Yang, Jinhyeok, et al.
Published: (2024)
A Comparative Analysis of Poetry Reading Audio: Singing, Narrating, or Somewhere In Between?
by: Choi, Kahyun, et al.
Published: (2024)
by: Choi, Kahyun, et al.
Published: (2024)
Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
by: Kim, Nam-Gyu
Published: (2025)
by: Kim, Nam-Gyu
Published: (2025)
Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting
by: Kim, Hounsu, et al.
Published: (2024)
by: Kim, Hounsu, et al.
Published: (2024)
Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs
by: Yin, Han, et al.
Published: (2025)
by: Yin, Han, et al.
Published: (2025)
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
by: Hai, Jiarui, et al.
Published: (2024)
by: Hai, Jiarui, et al.
Published: (2024)
TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument
by: Kim, Kyungsu, et al.
Published: (2025)
by: Kim, Kyungsu, et al.
Published: (2025)
Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
by: Nam, KiHyun, et al.
Published: (2025)
by: Nam, KiHyun, et al.
Published: (2025)
Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition
by: Koo, Inyong, et al.
Published: (2026)
by: Koo, Inyong, et al.
Published: (2026)
RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer
by: Hong, Seongho, et al.
Published: (2025)
by: Hong, Seongho, et al.
Published: (2025)
Patient-Aware Feature Alignment for Robust Lung Sound Classification:Cohesion-Separation and Global Alignment Losses
by: Jeong, Seung Gyu, et al.
Published: (2025)
by: Jeong, Seung Gyu, et al.
Published: (2025)
IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling
by: Huang, Kuan-Po, et al.
Published: (2025)
by: Huang, Kuan-Po, et al.
Published: (2025)
Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models
by: Yin, Han, et al.
Published: (2026)
by: Yin, Han, et al.
Published: (2026)
Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR
by: Magoshi, Ryo, et al.
Published: (2026)
by: Magoshi, Ryo, et al.
Published: (2026)
Listen through the Sound: Generative Speech Restoration Leveraging Acoustic Context Representation
by: Chung, Soo-Whan, et al.
Published: (2025)
by: Chung, Soo-Whan, et al.
Published: (2025)
Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction
by: Kim, Minchan, et al.
Published: (2024)
by: Kim, Minchan, et al.
Published: (2024)
Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS
by: Ko, Myeongjin, et al.
Published: (2023)
by: Ko, Myeongjin, et al.
Published: (2023)
Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions
by: Kim, Euiyeon, et al.
Published: (2025)
by: Kim, Euiyeon, et al.
Published: (2025)
DESAMO: A Device for Elder-Friendly Smart Homes Powered by Embedded LLM with Audio Modality
by: Choi, Youngwon, et al.
Published: (2025)
by: Choi, Youngwon, et al.
Published: (2025)
LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence
by: Lee, Hyeongkeun, et al.
Published: (2026)
by: Lee, Hyeongkeun, et al.
Published: (2026)
ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining
by: Diwan, Anuj, et al.
Published: (2026)
by: Diwan, Anuj, et al.
Published: (2026)
Soft Disentanglement in Frequency Bands for Neural Audio Codecs
by: Ginies, Benoit, et al.
Published: (2025)
by: Ginies, Benoit, et al.
Published: (2025)
DeFT-Mamba: Universal Multichannel Sound Separation and Polyphonic Audio Classification
by: Lee, Dongheon, et al.
Published: (2024)
by: Lee, Dongheon, et al.
Published: (2024)
AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion
by: Zhao, Junqi, et al.
Published: (2025)
by: Zhao, Junqi, et al.
Published: (2025)
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
by: Yuan, Yi, et al.
Published: (2025)
by: Yuan, Yi, et al.
Published: (2025)
AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework
by: Jia, Yuhang, et al.
Published: (2024)
by: Jia, Yuhang, et al.
Published: (2024)
Precise and Simple Audio-to-Score Alignment
by: Peter, Silvan, et al.
Published: (2026)
by: Peter, Silvan, et al.
Published: (2026)
DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval
by: Xin, Yifei, et al.
Published: (2024)
by: Xin, Yifei, et al.
Published: (2024)
PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text
by: Bang, Hayeon, et al.
Published: (2024)
by: Bang, Hayeon, et al.
Published: (2024)
Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation
by: Gu, Yi, et al.
Published: (2026)
by: Gu, Yi, et al.
Published: (2026)
Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss
by: Liu, Meizhu, et al.
Published: (2026)
by: Liu, Meizhu, et al.
Published: (2026)
MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation
by: Sun, Bochao, et al.
Published: (2026)
by: Sun, Bochao, et al.
Published: (2026)
Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment
by: Nihal, Ragib Amin, et al.
Published: (2025)
by: Nihal, Ragib Amin, et al.
Published: (2025)
AV-SSAN: Audio-Visual Selective DoA Estimation through Explicit Multi-Band Semantic-Spatial Alignment
by: Chen, Yu, et al.
Published: (2025)
by: Chen, Yu, et al.
Published: (2025)
Similar Items
-
Jamendo-QA: A Large-Scale Music Question Answering Dataset
by: Koh, Junyoung, et al.
Published: (2025) -
Instrumental Text-to-Music Generation with Auxiliary Conditioning Branches
by: Koh, Junyoung
Published: (2026) -
Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering
by: Koh, Junyoung, et al.
Published: (2026) -
AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval
by: Kim, Hyun Jun, et al.
Published: (2025) -
Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
by: Choi, Jeongsoo, et al.
Published: (2025)