:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Nguyen, Tan Dat, Kim, Ji-Hoon, Jang, Youngjoon, Kim, Jaehun, Chung, Joon Son
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Artificial Intelligence Signal Processing
Online Access:	https://arxiv.org/abs/2401.10032
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS
by: Nguyen, Tan Dat, et al.
Published: (2025)

MamTra: A Hybrid Mamba-Transformer Backbone for Speech Synthesis
by: Nguyen, Tan Dat, et al.
Published: (2026)

AdaptVC: High Quality Voice Conversion with Adaptive Learning
by: Kim, Jaehun, et al.
Published: (2025)

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis
by: Jung, Jaemin, et al.
Published: (2024)

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
by: Kim, Ji-Hoon, et al.
Published: (2023)

PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a Diffusion Probabilistic Model
by: Hono, Yukiya, et al.
Published: (2024)

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
by: Kim, Ji-Hoon, et al.
Published: (2024)

LP-CFM: Perceptual Invariance-Aware Conditional Flow Matching for Speech Modeling
by: Kwak, Doyeop, et al.
Published: (2025)

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
by: Gu, Yicheng, et al.
Published: (2024)

EDNet: A Versatile Speech Enhancement Framework with Gating Mamba Mechanism and Phase Shift-Invariant Training
by: Kwak, Doyeop, et al.
Published: (2025)

MusicHiFi: Fast High-Fidelity Stereo Vocoding
by: Zhu, Ge, et al.
Published: (2024)

SCORE: Scaling audio generation using Standardized COmposite REwards
by: Jung, Jaemin, et al.
Published: (2025)

Speak in the Scene: Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation
by: Kim, Miseul, et al.
Published: (2024)

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
by: Choi, Jeongsoo, et al.
Published: (2025)

A Neural Denoising Vocoder for Clean Waveform Generation from Noisy Mel-Spectrogram based on Amplitude and Phase Predictions
by: Du, Hui-Peng, et al.
Published: (2024)

InfiniteAudio: Infinite-Length Audio Generation with Consistency
by: Jung, Chaeyoung, et al.
Published: (2025)

QHARMA-GAN: Quasi-Harmonic Neural Vocoder based on Autoregressive Moving Average Model
by: Chen, Shaowen, et al.
Published: (2025)

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
by: Nguyen, Tan Dat, et al.
Published: (2024)

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
by: Jung, Chaeyoung, et al.
Published: (2024)

From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
by: Kim, Ji-Hoon, et al.
Published: (2025)

PLDNet: PLD-Guided Lightweight Deep Network Boosted by Efficient Attention for Handheld Dual-Microphone Speech Enhancement
by: Zhou, Nan, et al.
Published: (2024)

Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
by: Choi, Jeongsoo, et al.
Published: (2025)

Real-Time Streaming Mel Vocoding with Generative Flow Matching
by: Welker, Simon, et al.
Published: (2025)

GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model
by: Liu, Haocheng, et al.
Published: (2024)

GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis
by: Baoueb, Teysir, et al.
Published: (2025)

Probing Cross-modal Information Hubs in Audio-Visual LLMs
by: Jung, Jihoo, et al.
Published: (2026)

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model
by: Pham, The Hieu, et al.
Published: (2025)

VoxSim: A perceptual voice similarity dataset
by: Ahn, Junseok, et al.
Published: (2024)

Lightweight Audio Segmentation for Long-form Speech Translation
by: Lee, Jaesong, et al.
Published: (2024)

UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
by: Choi, Woongjib, et al.
Published: (2025)

SUNAC: Source-aware Unified Neural Audio Codec
by: Aihara, Ryo, et al.
Published: (2025)

The Overview of Segmental Durations Modification Algorithms on Speech Signal Characteristics
by: Jang, Kyeomeun, et al.
Published: (2025)

Relational Proxy Loss for Audio-Text based Keyword Spotting
by: Jung, Youngmoon, et al.
Published: (2024)

UNMIXX: Untangling Highly Correlated Singing Voices Mixtures
by: Jung, Jihoo, et al.
Published: (2026)

Wideband Relative Transfer Function (RTF) Estimation Exploiting Frequency Correlations
by: Bologni, Giovanni, et al.
Published: (2024)

ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction
by: Yang, Shu-wen, et al.
Published: (2025)

FUN-SSL: Full-band Layer Followed by U-Net with Narrow-band Layers for Multiple Moving Sound Source Localization
by: Choi, Yuseon, et al.
Published: (2025)

SpeechMLC: Speech Multi-label Classification
by: Kim, Miseul, et al.
Published: (2025)

Neural Spectral Band Generation for Audio Coding
by: Choi, Woongjib, et al.
Published: (2025)

Chirp Group Delay based Onset Detection in Instruments with Fast Attack
by: Joysingh, S. Johanan, et al.
Published: (2024)