:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yu, Xinyue, Fang, Youqing, Wu, Pingyu, Ye, Guoyang, Zhou, Wenbo, Zhang, Weiming, Xiao, Song
Format:	Preprint
Published:	2025
Subjects:	Sound Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.12074
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec
by: Li, Tao, et al.
Published: (2025)

A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding
by: Ye, Runchuan, et al.
Published: (2025)

Fine-Grained Quantitative Emotion Editing for Speech Generation
by: Inoue, Sho, et al.
Published: (2024)

VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
by: Zhou, Yixuan, et al.
Published: (2025)

Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition
by: Shen, Siyuan, et al.
Published: (2024)

MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control
by: Mai, Jialong, et al.
Published: (2026)

Factorized RVQ-GAN For Disentangled Speech Tokenization
by: Khurana, Sameer, et al.
Published: (2025)

Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis
by: Wang, Haoshen, et al.
Published: (2026)

Fine-Grained and Interpretable Neural Speech Editing
by: Morrison, Max, et al.
Published: (2024)

Speech-to-See: End-to-End Speech-Driven Open-Set Object Detection
by: Lu, Wenhuan, et al.
Published: (2025)

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation
by: Deng, Yimin, et al.
Published: (2024)

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering
by: Xie, Tianxin, et al.
Published: (2025)

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
by: Song, Yuhan, et al.
Published: (2025)

Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition
by: Wagner, Dominik, et al.
Published: (2025)

DisSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration
by: Liang, Ziqi, et al.
Published: (2026)

DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance
by: Yin, Kang, et al.
Published: (2025)

Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech
by: Yao, Jixun, et al.
Published: (2025)

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
by: Lu, Ye-Xin, et al.
Published: (2025)

Fine-Grained Frame Modeling in Multi-head Self-Attention for Speech Deepfake Detection
by: Phuong, Tuan Dat, et al.
Published: (2026)

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
by: Shi, Jiacheng, et al.
Published: (2026)

Learning Disentangled Speech Representations
by: Brima, Yusuf, et al.
Published: (2023)

Multi-Scale Accent Modeling and Disentangling for Multi-Speaker Multi-Accent Text-to-Speech Synthesis
by: Zhou, Xuehao, et al.
Published: (2024)

Speaker-Disentangled Remote Speech Detection of Asthma and COPD Exacerbations
by: Yan, Yuyang, et al.
Published: (2026)

AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis
by: Qi, Tianhua, et al.
Published: (2026)

Efficient Long-Form Speech Recognition for General Speech In-Context Learning
by: Yen, Hao, et al.
Published: (2024)

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models
by: Ji, Shengpeng, et al.
Published: (2023)

FleSpeech: Flexibly Controllable Speech Generation with Various Prompts
by: Li, Hanzhao, et al.
Published: (2025)

Speaker Disentanglement of Speech Pre-trained Model Based on Interpretability
by: Zhu, Xiaoxu, et al.
Published: (2025)

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval
by: Deng, Yimin, et al.
Published: (2024)

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
by: Wu, Yihan, et al.
Published: (2024)

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?
by: Fang, Qingkai, et al.
Published: (2024)

Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance
by: Ochiai, Tsubasa, et al.
Published: (2024)

Koopman Regularized Deep Speech Disentanglement for Speaker Verification
by: Chazaridis, Nikos, et al.
Published: (2026)

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
by: Wang, Yuanyuan, et al.
Published: (2025)

WildElder: A Chinese Elderly Speech Dataset from the Wild with Fine-Grained Manual Annotations
by: Wang, Hui, et al.
Published: (2025)

Unified Architecture and Unsupervised Speech Disentanglement for Speaker Embedding-Free Enrollment in Personalized Speech Enhancement
by: Huang, Ziling, et al.
Published: (2025)

SLM-SS: Speech Language Model for Generative Speech Separation
by: Li, Tianhua, et al.
Published: (2026)

WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation
by: Fang, Zihao, et al.
Published: (2026)

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
by: Li, Jiaqi, et al.
Published: (2024)

GSRM: Generative Speech Reward Model for Speech RLHF
by: Shen, Maohao, et al.
Published: (2026)