:: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lee, Junhyeok, Kim, Hyeongju
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2409.07704
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Length-Aware Rotary Position Embedding for Text-Speech Alignment
by: Kim, Hyeongju, et al.
Published: (2025)

Training Flow Matching Models with Reliable Labels via Self-Purification
by: Kim, Hyeongju, et al.
Published: (2025)

Improving Test-Time Performance of RVQ-based Neural Codecs
by: Kim, Hyeongju, et al.
Published: (2025)

JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis
by: Cho, Hyunjae, et al.
Published: (2024)

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance
by: Yang, Jinhyeok, et al.
Published: (2024)

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment
by: Neekhara, Paarth, et al.
Published: (2024)

MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances
by: Lee, Junhyeok, et al.
Published: (2025)

Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec
by: Lee, Junhyeok, et al.
Published: (2026)

Wave-U-Mamba: An End-To-End Framework For High-Quality And Efficient Speech Super Resolution
by: Lee, Yongjoon, et al.
Published: (2024)

Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting
by: Jung, Youngmoon, et al.
Published: (2025)

MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance
by: Kim, Semin, et al.
Published: (2024)

DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment
by: Oh, Hyung-Seok, et al.
Published: (2024)

Searching for Effective Preprocessing Method and CNN-based Architecture with Efficient Channel Attention on Speech Emotion Recognition
by: Kim, Byunggun, et al.
Published: (2024)

Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders
by: Kim, Seungbae, et al.
Published: (2025)

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech
by: Wang, Helin, et al.
Published: (2025)

TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
by: Peng, Jing, et al.
Published: (2026)

UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
by: Choi, Woongjib, et al.
Published: (2025)

FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching
by: Yun, Jun-Hak, et al.
Published: (2025)

Spatiotemporal Emotional Synchrony in Dyadic Interactions: The Role of Speech Conditions in Facial and Vocal Affective Alignment
by: Herbuela, Von Ralph Dane Marquez, et al.
Published: (2025)

CONMOD: Controllable Neural Frame-based Modulation Effects
by: Lee, Gyubin, et al.
Published: (2024)

Inference-time Scaling for Diffusion-based Audio Super-resolution
by: Jin, Yizhu, et al.
Published: (2025)

Neural Speech Embeddings for Speech Synthesis Based on Deep Generative Networks
by: Lee, Seo-Hyun, et al.
Published: (2023)

Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations
by: Kim, Jaeyeon, et al.
Published: (2024)

Adaptive Duration Model for Text Speech Alignment
by: Cao, Junjie
Published: (2025)

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
by: Hong, Changi, et al.
Published: (2026)

Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
by: Kim, Nam-Gyu, et al.
Published: (2025)

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings
by: Rhyu, Seungyeon, et al.
Published: (2024)

Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9
by: Lee, Do Hyun, et al.
Published: (2024)

Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion
by: Jang, Jaehyuk, et al.
Published: (2026)

LeVo: High-Quality Song Generation with Multi-Preference Alignment
by: Lei, Shun, et al.
Published: (2025)

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)

FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning
by: Kang, Ju Yeon, et al.
Published: (2025)

DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
by: Du, Zongcai, et al.
Published: (2025)

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech
by: Cho, Deok-Hyeon, et al.
Published: (2024)

DroneAudioset: An Audio Dataset for Drone-based Search and Rescue
by: Gupta, Chitralekha, et al.
Published: (2025)

Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation
by: Zhang, Kang, et al.
Published: (2025)

VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation
by: Kim, Yubin, et al.
Published: (2025)

A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss
by: Tamiti, Tarikul Islam, et al.
Published: (2025)

Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)

EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
by: Kim, Jaeyeon, et al.
Published: (2024)