:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Koh, Junyoung, Kim, Soo Yong, Choi, Gyu Hyeong, Choi, Yongwon
Format:	Preprint
Published:	2025
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2509.20891
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Jamendo-QA: A Large-Scale Music Question Answering Dataset
by: Koh, Junyoung, et al.
Published: (2025)

Instrumental Text-to-Music Generation with Auxiliary Conditioning Branches
by: Koh, Junyoung
Published: (2026)

Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering
by: Koh, Junyoung, et al.
Published: (2026)

AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval
by: Kim, Hyun Jun, et al.
Published: (2025)

Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
by: Choi, Jeongsoo, et al.
Published: (2025)

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models
by: Choi, Jeongsoo, et al.
Published: (2023)

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance
by: Yang, Jinhyeok, et al.
Published: (2024)

A Comparative Analysis of Poetry Reading Audio: Singing, Narrating, or Somewhere In Between?
by: Choi, Kahyun, et al.
Published: (2024)

Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
by: Kim, Nam-Gyu
Published: (2025)

Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting
by: Kim, Hounsu, et al.
Published: (2024)

Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs
by: Yin, Han, et al.
Published: (2025)

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
by: Hai, Jiarui, et al.
Published: (2024)

TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument
by: Kim, Kyungsu, et al.
Published: (2025)

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
by: Nam, KiHyun, et al.
Published: (2025)

Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition
by: Koo, Inyong, et al.
Published: (2026)

RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer
by: Hong, Seongho, et al.
Published: (2025)

Patient-Aware Feature Alignment for Robust Lung Sound Classification:Cohesion-Separation and Global Alignment Losses
by: Jeong, Seung Gyu, et al.
Published: (2025)

IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling
by: Huang, Kuan-Po, et al.
Published: (2025)

Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models
by: Yin, Han, et al.
Published: (2026)

Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR
by: Magoshi, Ryo, et al.
Published: (2026)

Listen through the Sound: Generative Speech Restoration Leveraging Acoustic Context Representation
by: Chung, Soo-Whan, et al.
Published: (2025)

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction
by: Kim, Minchan, et al.
Published: (2024)

Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS
by: Ko, Myeongjin, et al.
Published: (2023)

Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions
by: Kim, Euiyeon, et al.
Published: (2025)

DESAMO: A Device for Elder-Friendly Smart Homes Powered by Embedded LLM with Audio Modality
by: Choi, Youngwon, et al.
Published: (2025)

LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence
by: Lee, Hyeongkeun, et al.
Published: (2026)

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining
by: Diwan, Anuj, et al.
Published: (2026)

Soft Disentanglement in Frequency Bands for Neural Audio Codecs
by: Ginies, Benoit, et al.
Published: (2025)

DeFT-Mamba: Universal Multichannel Sound Separation and Polyphonic Audio Classification
by: Lee, Dongheon, et al.
Published: (2024)

AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion
by: Zhao, Junqi, et al.
Published: (2025)

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
by: Yuan, Yi, et al.
Published: (2025)

AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework
by: Jia, Yuhang, et al.
Published: (2024)

Precise and Simple Audio-to-Score Alignment
by: Peter, Silvan, et al.
Published: (2026)

DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval
by: Xin, Yifei, et al.
Published: (2024)

PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text
by: Bang, Hayeon, et al.
Published: (2024)

Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation
by: Gu, Yi, et al.
Published: (2026)

Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss
by: Liu, Meizhu, et al.
Published: (2026)

MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation
by: Sun, Bochao, et al.
Published: (2026)

Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment
by: Nihal, Ragib Amin, et al.
Published: (2025)

AV-SSAN: Audio-Visual Selective DoA Estimation through Explicit Multi-Band Semantic-Spatial Alignment
by: Chen, Yu, et al.
Published: (2025)