Saved in:
| Main Authors: | Lee, Junhyeok, Kim, Hyeongju |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.07704 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Length-Aware Rotary Position Embedding for Text-Speech Alignment
by: Kim, Hyeongju, et al.
Published: (2025)
by: Kim, Hyeongju, et al.
Published: (2025)
Training Flow Matching Models with Reliable Labels via Self-Purification
by: Kim, Hyeongju, et al.
Published: (2025)
by: Kim, Hyeongju, et al.
Published: (2025)
Improving Test-Time Performance of RVQ-based Neural Codecs
by: Kim, Hyeongju, et al.
Published: (2025)
by: Kim, Hyeongju, et al.
Published: (2025)
JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis
by: Cho, Hyunjae, et al.
Published: (2024)
by: Cho, Hyunjae, et al.
Published: (2024)
DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance
by: Yang, Jinhyeok, et al.
Published: (2024)
by: Yang, Jinhyeok, et al.
Published: (2024)
Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment
by: Neekhara, Paarth, et al.
Published: (2024)
by: Neekhara, Paarth, et al.
Published: (2024)
MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances
by: Lee, Junhyeok, et al.
Published: (2025)
by: Lee, Junhyeok, et al.
Published: (2025)
Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec
by: Lee, Junhyeok, et al.
Published: (2026)
by: Lee, Junhyeok, et al.
Published: (2026)
Wave-U-Mamba: An End-To-End Framework For High-Quality And Efficient Speech Super Resolution
by: Lee, Yongjoon, et al.
Published: (2024)
by: Lee, Yongjoon, et al.
Published: (2024)
Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting
by: Jung, Youngmoon, et al.
Published: (2025)
by: Jung, Youngmoon, et al.
Published: (2025)
MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance
by: Kim, Semin, et al.
Published: (2024)
by: Kim, Semin, et al.
Published: (2024)
DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment
by: Oh, Hyung-Seok, et al.
Published: (2024)
by: Oh, Hyung-Seok, et al.
Published: (2024)
Searching for Effective Preprocessing Method and CNN-based Architecture with Efficient Channel Attention on Speech Emotion Recognition
by: Kim, Byunggun, et al.
Published: (2024)
by: Kim, Byunggun, et al.
Published: (2024)
Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders
by: Kim, Seungbae, et al.
Published: (2025)
by: Kim, Seungbae, et al.
Published: (2025)
CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech
by: Wang, Helin, et al.
Published: (2025)
by: Wang, Helin, et al.
Published: (2025)
TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
by: Peng, Jing, et al.
Published: (2026)
by: Peng, Jing, et al.
Published: (2026)
UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
by: Choi, Woongjib, et al.
Published: (2025)
by: Choi, Woongjib, et al.
Published: (2025)
FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching
by: Yun, Jun-Hak, et al.
Published: (2025)
by: Yun, Jun-Hak, et al.
Published: (2025)
Spatiotemporal Emotional Synchrony in Dyadic Interactions: The Role of Speech Conditions in Facial and Vocal Affective Alignment
by: Herbuela, Von Ralph Dane Marquez, et al.
Published: (2025)
by: Herbuela, Von Ralph Dane Marquez, et al.
Published: (2025)
CONMOD: Controllable Neural Frame-based Modulation Effects
by: Lee, Gyubin, et al.
Published: (2024)
by: Lee, Gyubin, et al.
Published: (2024)
Inference-time Scaling for Diffusion-based Audio Super-resolution
by: Jin, Yizhu, et al.
Published: (2025)
by: Jin, Yizhu, et al.
Published: (2025)
Neural Speech Embeddings for Speech Synthesis Based on Deep Generative Networks
by: Lee, Seo-Hyun, et al.
Published: (2023)
by: Lee, Seo-Hyun, et al.
Published: (2023)
Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
Adaptive Duration Model for Text Speech Alignment
by: Cao, Junjie
Published: (2025)
by: Cao, Junjie
Published: (2025)
PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
by: Hong, Changi, et al.
Published: (2026)
by: Hong, Changi, et al.
Published: (2026)
Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
by: Kim, Nam-Gyu, et al.
Published: (2025)
by: Kim, Nam-Gyu, et al.
Published: (2025)
Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings
by: Rhyu, Seungyeon, et al.
Published: (2024)
by: Rhyu, Seungyeon, et al.
Published: (2024)
Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9
by: Lee, Do Hyun, et al.
Published: (2024)
by: Lee, Do Hyun, et al.
Published: (2024)
Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion
by: Jang, Jaehyuk, et al.
Published: (2026)
by: Jang, Jaehyuk, et al.
Published: (2026)
LeVo: High-Quality Song Generation with Multi-Preference Alignment
by: Lei, Shun, et al.
Published: (2025)
by: Lei, Shun, et al.
Published: (2025)
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning
by: Kang, Ju Yeon, et al.
Published: (2025)
by: Kang, Ju Yeon, et al.
Published: (2025)
DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
by: Du, Zongcai, et al.
Published: (2025)
by: Du, Zongcai, et al.
Published: (2025)
EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech
by: Cho, Deok-Hyeon, et al.
Published: (2024)
by: Cho, Deok-Hyeon, et al.
Published: (2024)
DroneAudioset: An Audio Dataset for Drone-based Search and Rescue
by: Gupta, Chitralekha, et al.
Published: (2025)
by: Gupta, Chitralekha, et al.
Published: (2025)
Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation
by: Zhang, Kang, et al.
Published: (2025)
by: Zhang, Kang, et al.
Published: (2025)
VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation
by: Kim, Yubin, et al.
Published: (2025)
by: Kim, Yubin, et al.
Published: (2025)
A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss
by: Tamiti, Tarikul Islam, et al.
Published: (2025)
by: Tamiti, Tarikul Islam, et al.
Published: (2025)
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
by: Kim, Jaeyeon, et al.
Published: (2024)
by: Kim, Jaeyeon, et al.
Published: (2024)
Similar Items
-
Length-Aware Rotary Position Embedding for Text-Speech Alignment
by: Kim, Hyeongju, et al.
Published: (2025) -
Training Flow Matching Models with Reliable Labels via Self-Purification
by: Kim, Hyeongju, et al.
Published: (2025) -
Improving Test-Time Performance of RVQ-based Neural Codecs
by: Kim, Hyeongju, et al.
Published: (2025) -
JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis
by: Cho, Hyunjae, et al.
Published: (2024) -
DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance
by: Yang, Jinhyeok, et al.
Published: (2024)