Saved in:
| Main Authors: | Yu, Xinyue, Fang, Youqing, Wu, Pingyu, Ye, Guoyang, Zhou, Wenbo, Zhang, Weiming, Xiao, Song |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.12074 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec
by: Li, Tao, et al.
Published: (2025)
by: Li, Tao, et al.
Published: (2025)
A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding
by: Ye, Runchuan, et al.
Published: (2025)
by: Ye, Runchuan, et al.
Published: (2025)
Fine-Grained Quantitative Emotion Editing for Speech Generation
by: Inoue, Sho, et al.
Published: (2024)
by: Inoue, Sho, et al.
Published: (2024)
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
by: Zhou, Yixuan, et al.
Published: (2025)
by: Zhou, Yixuan, et al.
Published: (2025)
Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition
by: Shen, Siyuan, et al.
Published: (2024)
by: Shen, Siyuan, et al.
Published: (2024)
MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control
by: Mai, Jialong, et al.
Published: (2026)
by: Mai, Jialong, et al.
Published: (2026)
Factorized RVQ-GAN For Disentangled Speech Tokenization
by: Khurana, Sameer, et al.
Published: (2025)
by: Khurana, Sameer, et al.
Published: (2025)
Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis
by: Wang, Haoshen, et al.
Published: (2026)
by: Wang, Haoshen, et al.
Published: (2026)
Fine-Grained and Interpretable Neural Speech Editing
by: Morrison, Max, et al.
Published: (2024)
by: Morrison, Max, et al.
Published: (2024)
Speech-to-See: End-to-End Speech-Driven Open-Set Object Detection
by: Lu, Wenhuan, et al.
Published: (2025)
by: Lu, Wenhuan, et al.
Published: (2025)
Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation
by: Deng, Yimin, et al.
Published: (2024)
by: Deng, Yimin, et al.
Published: (2024)
EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering
by: Xie, Tianxin, et al.
Published: (2025)
by: Xie, Tianxin, et al.
Published: (2025)
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
by: Song, Yuhan, et al.
Published: (2025)
by: Song, Yuhan, et al.
Published: (2025)
Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition
by: Wagner, Dominik, et al.
Published: (2025)
by: Wagner, Dominik, et al.
Published: (2025)
DisSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration
by: Liang, Ziqi, et al.
Published: (2026)
by: Liang, Ziqi, et al.
Published: (2026)
DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance
by: Yin, Kang, et al.
Published: (2025)
by: Yin, Kang, et al.
Published: (2025)
Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech
by: Yao, Jixun, et al.
Published: (2025)
by: Yao, Jixun, et al.
Published: (2025)
DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
by: Lu, Ye-Xin, et al.
Published: (2025)
by: Lu, Ye-Xin, et al.
Published: (2025)
Fine-Grained Frame Modeling in Multi-head Self-Attention for Speech Deepfake Detection
by: Phuong, Tuan Dat, et al.
Published: (2026)
by: Phuong, Tuan Dat, et al.
Published: (2026)
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
by: Shi, Jiacheng, et al.
Published: (2026)
by: Shi, Jiacheng, et al.
Published: (2026)
Learning Disentangled Speech Representations
by: Brima, Yusuf, et al.
Published: (2023)
by: Brima, Yusuf, et al.
Published: (2023)
Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis
by: Zhou, Xuehao, et al.
Published: (2024)
by: Zhou, Xuehao, et al.
Published: (2024)
Speaker-Disentangled Remote Speech Detection of Asthma and COPD Exacerbations
by: Yan, Yuyang, et al.
Published: (2026)
by: Yan, Yuyang, et al.
Published: (2026)
AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis
by: Qi, Tianhua, et al.
Published: (2026)
by: Qi, Tianhua, et al.
Published: (2026)
Efficient Long-Form Speech Recognition for General Speech In-Context Learning
by: Yen, Hao, et al.
Published: (2024)
by: Yen, Hao, et al.
Published: (2024)
TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models
by: Ji, Shengpeng, et al.
Published: (2023)
by: Ji, Shengpeng, et al.
Published: (2023)
FleSpeech: Flexibly Controllable Speech Generation with Various Prompts
by: Li, Hanzhao, et al.
Published: (2025)
by: Li, Hanzhao, et al.
Published: (2025)
Speaker Disentanglement of Speech Pre-trained Model Based on Interpretability
by: Zhu, Xiaoxu, et al.
Published: (2025)
by: Zhu, Xiaoxu, et al.
Published: (2025)
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval
by: Deng, Yimin, et al.
Published: (2024)
by: Deng, Yimin, et al.
Published: (2024)
SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
by: Wu, Yihan, et al.
Published: (2024)
by: Wu, Yihan, et al.
Published: (2024)
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?
by: Fang, Qingkai, et al.
Published: (2024)
by: Fang, Qingkai, et al.
Published: (2024)
Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance
by: Ochiai, Tsubasa, et al.
Published: (2024)
by: Ochiai, Tsubasa, et al.
Published: (2024)
Koopman Regularized Deep Speech Disentanglement for Speaker Verification
by: Chazaridis, Nikos, et al.
Published: (2026)
by: Chazaridis, Nikos, et al.
Published: (2026)
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
by: Wang, Yuanyuan, et al.
Published: (2025)
by: Wang, Yuanyuan, et al.
Published: (2025)
WildElder: A Chinese Elderly Speech Dataset from the Wild with Fine-Grained Manual Annotations
by: Wang, Hui, et al.
Published: (2025)
by: Wang, Hui, et al.
Published: (2025)
Unified Architecture and Unsupervised Speech Disentanglement for Speaker Embedding-Free Enrollment in Personalized Speech Enhancement
by: Huang, Ziling, et al.
Published: (2025)
by: Huang, Ziling, et al.
Published: (2025)
SLM-SS: Speech Language Model for Generative Speech Separation
by: Li, Tianhua, et al.
Published: (2026)
by: Li, Tianhua, et al.
Published: (2026)
WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation
by: Fang, Zihao, et al.
Published: (2026)
by: Fang, Zihao, et al.
Published: (2026)
Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
by: Li, Jiaqi, et al.
Published: (2024)
by: Li, Jiaqi, et al.
Published: (2024)
GSRM: Generative Speech Reward Model for Speech RLHF
by: Shen, Maohao, et al.
Published: (2026)
by: Shen, Maohao, et al.
Published: (2026)
Similar Items
-
DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec
by: Li, Tao, et al.
Published: (2025) -
A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding
by: Ye, Runchuan, et al.
Published: (2025) -
Fine-Grained Quantitative Emotion Editing for Speech Generation
by: Inoue, Sho, et al.
Published: (2024) -
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
by: Zhou, Yixuan, et al.
Published: (2025) -
Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition
by: Shen, Siyuan, et al.
Published: (2024)