:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Tao, Ge, Wenshuo, Wang, Zhichao, Cui, Zihao, Ma, Yong, Gao, Yingying, Deng, Chao, Zhang, Shilei, Feng, Junlan
Format:	Preprint
Published:	2025
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2512.13251
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Plugin Speech Enhancement: A Universal Speech Enhancement Framework Inspired by Dynamic Neural Network
by: Chen, Yanan, et al.
Published: (2024)

OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2026)

DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners
by: Luo, Xiaoxue, et al.
Published: (2025)

HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling
by: Si, Yuke, et al.
Published: (2025)

PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models
by: Yang, Runyan, et al.
Published: (2024)

On Calibration of Speech Classification Models: Insights from Energy-Based Model Investigations
by: Hao, Yaqian, et al.
Published: (2024)

RepCodec: A Speech Representation Codec for Speech Tokenization
by: Huang, Zhichao, et al.
Published: (2023)

MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition
by: Sun, Haiyang, et al.
Published: (2023)

GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model
by: Gao, Yingying, et al.
Published: (2024)

FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec
by: Halychanskyi, Yurii, et al.
Published: (2025)

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
by: Shi, Jiacheng, et al.
Published: (2026)

DisSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration
by: Liang, Ziqi, et al.
Published: (2026)

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
by: Ju, Zeqian, et al.
Published: (2024)

Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation
by: Yang, Runyan, et al.
Published: (2025)

AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ
by: Meng, Zhaoyang, et al.
Published: (2026)

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)

Fewer-token Neural Speech Codec with Time-invariant Codes
by: Ren, Yong, et al.
Published: (2023)

B-GRPO: Unsupervised Speech Emotion Recognition based on Batched-Group Relative Policy Optimization
by: Gao, Yingying, et al.
Published: (2026)

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
by: Wang, Yuancheng, et al.
Published: (2024)

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
by: Kim, Jaehyeon, et al.
Published: (2024)

Personalized Neural Speech Codec
by: Jang, Inseon, et al.
Published: (2024)

SpatialCodec: Neural Spatial Speech Coding
by: Xu, Zhongweiyang, et al.
Published: (2023)

CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models
by: Chen, Junyang, et al.
Published: (2026)

A Neural Speech Codec for Noise Robust Speech Coding
by: Huang, Jiayi, et al.
Published: (2023)

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models
by: Ji, Shengpeng, et al.
Published: (2023)

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
by: Wang, Tianrui, et al.
Published: (2025)

U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation
by: Yang, Xusheng, et al.
Published: (2025)

Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction
by: Zhang, Leying, et al.
Published: (2025)

WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification
by: Zhou, Junzuo, et al.
Published: (2024)

SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models
by: Wang, Linqin, et al.
Published: (2024)

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles
by: Liu, Jiaxuan, et al.
Published: (2024)

Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora
by: Nespoli, Francesco, et al.
Published: (2024)

MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt
by: Wu, Zhichao, et al.
Published: (2025)

Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling
by: Karapiperis, Sotirios, et al.
Published: (2024)

HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
by: Nishimura, Yuto, et al.
Published: (2024)

FlashSpeech: Efficient Zero-Shot Speech Synthesis
by: Ye, Zhen, et al.
Published: (2024)

Towards Audio Codec-based Speech Separation
by: Yip, Jia Qi, et al.
Published: (2024)

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
by: Li, Jiaqi, et al.
Published: (2024)

BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis
by: Xing, Jingyuan, et al.
Published: (2025)

Zero-Shot Text-to-Speech for Vietnamese
by: Vu, Thi, et al.
Published: (2025)