Saved in:
| Main Authors: | Li, Tao, Ge, Wenshuo, Wang, Zhichao, Cui, Zihao, Ma, Yong, Gao, Yingying, Deng, Chao, Zhang, Shilei, Feng, Junlan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.13251 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Plugin Speech Enhancement: A Universal Speech Enhancement Framework Inspired by Dynamic Neural Network
by: Chen, Yanan, et al.
Published: (2024)
by: Chen, Yanan, et al.
Published: (2024)
OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2026)
by: Wang, Zhichao, et al.
Published: (2026)
DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners
by: Luo, Xiaoxue, et al.
Published: (2025)
by: Luo, Xiaoxue, et al.
Published: (2025)
HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling
by: Si, Yuke, et al.
Published: (2025)
by: Si, Yuke, et al.
Published: (2025)
PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models
by: Yang, Runyan, et al.
Published: (2024)
by: Yang, Runyan, et al.
Published: (2024)
On Calibration of Speech Classification Models: Insights from Energy-Based Model Investigations
by: Hao, Yaqian, et al.
Published: (2024)
by: Hao, Yaqian, et al.
Published: (2024)
RepCodec: A Speech Representation Codec for Speech Tokenization
by: Huang, Zhichao, et al.
Published: (2023)
by: Huang, Zhichao, et al.
Published: (2023)
MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition
by: Sun, Haiyang, et al.
Published: (2023)
by: Sun, Haiyang, et al.
Published: (2023)
GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model
by: Gao, Yingying, et al.
Published: (2024)
by: Gao, Yingying, et al.
Published: (2024)
FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec
by: Halychanskyi, Yurii, et al.
Published: (2025)
by: Halychanskyi, Yurii, et al.
Published: (2025)
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
by: Shi, Jiacheng, et al.
Published: (2026)
by: Shi, Jiacheng, et al.
Published: (2026)
DisSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration
by: Liang, Ziqi, et al.
Published: (2026)
by: Liang, Ziqi, et al.
Published: (2026)
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
by: Ju, Zeqian, et al.
Published: (2024)
by: Ju, Zeqian, et al.
Published: (2024)
Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation
by: Yang, Runyan, et al.
Published: (2025)
by: Yang, Runyan, et al.
Published: (2025)
AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ
by: Meng, Zhaoyang, et al.
Published: (2026)
by: Meng, Zhaoyang, et al.
Published: (2026)
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)
by: Xue, Jinlong, et al.
Published: (2024)
Fewer-token Neural Speech Codec with Time-invariant Codes
by: Ren, Yong, et al.
Published: (2023)
by: Ren, Yong, et al.
Published: (2023)
B-GRPO: Unsupervised Speech Emotion Recognition based on Batched-Group Relative Policy Optimization
by: Gao, Yingying, et al.
Published: (2026)
by: Gao, Yingying, et al.
Published: (2026)
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
by: Wang, Yuancheng, et al.
Published: (2024)
by: Wang, Yuancheng, et al.
Published: (2024)
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
by: Kim, Jaehyeon, et al.
Published: (2024)
by: Kim, Jaehyeon, et al.
Published: (2024)
Personalized Neural Speech Codec
by: Jang, Inseon, et al.
Published: (2024)
by: Jang, Inseon, et al.
Published: (2024)
SpatialCodec: Neural Spatial Speech Coding
by: Xu, Zhongweiyang, et al.
Published: (2023)
by: Xu, Zhongweiyang, et al.
Published: (2023)
CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models
by: Chen, Junyang, et al.
Published: (2026)
by: Chen, Junyang, et al.
Published: (2026)
A Neural Speech Codec for Noise Robust Speech Coding
by: Huang, Jiayi, et al.
Published: (2023)
by: Huang, Jiayi, et al.
Published: (2023)
TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models
by: Ji, Shengpeng, et al.
Published: (2023)
by: Ji, Shengpeng, et al.
Published: (2023)
Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
by: Wang, Tianrui, et al.
Published: (2025)
by: Wang, Tianrui, et al.
Published: (2025)
U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation
by: Yang, Xusheng, et al.
Published: (2025)
by: Yang, Xusheng, et al.
Published: (2025)
Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction
by: Zhang, Leying, et al.
Published: (2025)
by: Zhang, Leying, et al.
Published: (2025)
WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification
by: Zhou, Junzuo, et al.
Published: (2024)
by: Zhou, Junzuo, et al.
Published: (2024)
SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models
by: Wang, Linqin, et al.
Published: (2024)
by: Wang, Linqin, et al.
Published: (2024)
DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles
by: Liu, Jiaxuan, et al.
Published: (2024)
by: Liu, Jiaxuan, et al.
Published: (2024)
Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora
by: Nespoli, Francesco, et al.
Published: (2024)
by: Nespoli, Francesco, et al.
Published: (2024)
MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt
by: Wu, Zhichao, et al.
Published: (2025)
by: Wu, Zhichao, et al.
Published: (2025)
Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling
by: Karapiperis, Sotirios, et al.
Published: (2024)
by: Karapiperis, Sotirios, et al.
Published: (2024)
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
by: Nishimura, Yuto, et al.
Published: (2024)
by: Nishimura, Yuto, et al.
Published: (2024)
FlashSpeech: Efficient Zero-Shot Speech Synthesis
by: Ye, Zhen, et al.
Published: (2024)
by: Ye, Zhen, et al.
Published: (2024)
Towards Audio Codec-based Speech Separation
by: Yip, Jia Qi, et al.
Published: (2024)
by: Yip, Jia Qi, et al.
Published: (2024)
Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
by: Li, Jiaqi, et al.
Published: (2024)
by: Li, Jiaqi, et al.
Published: (2024)
BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis
by: Xing, Jingyuan, et al.
Published: (2025)
by: Xing, Jingyuan, et al.
Published: (2025)
Zero-Shot Text-to-Speech for Vietnamese
by: Vu, Thi, et al.
Published: (2025)
by: Vu, Thi, et al.
Published: (2025)
Similar Items
-
Plugin Speech Enhancement: A Universal Speech Enhancement Framework Inspired by Dynamic Neural Network
by: Chen, Yanan, et al.
Published: (2024) -
OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2026) -
DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners
by: Luo, Xiaoxue, et al.
Published: (2025) -
HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling
by: Si, Yuke, et al.
Published: (2025) -
PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models
by: Yang, Runyan, et al.
Published: (2024)