Saved in:
| Main Authors: | Zhang, Ruonan, Mu, Lingzhou, Wu, Xixin, Zhang, Kai |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.17021 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy
by: Wu, Wenxuan, et al.
Published: (2024)
by: Wu, Wenxuan, et al.
Published: (2024)
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
by: Wang, Yuanyuan, et al.
Published: (2025)
by: Wang, Yuanyuan, et al.
Published: (2025)
SponTTS: modeling and transferring spontaneous style for TTS
by: Li, Hanzhao, et al.
Published: (2023)
by: Li, Hanzhao, et al.
Published: (2023)
Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets
by: Liu, Chenlin, et al.
Published: (2025)
by: Liu, Chenlin, et al.
Published: (2025)
MBCodec:Thorough disentangle for high-fidelity audio compression
by: Zhang, Ruonan, et al.
Published: (2025)
by: Zhang, Ruonan, et al.
Published: (2025)
DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions
by: Chen, Weidong, et al.
Published: (2025)
by: Chen, Weidong, et al.
Published: (2025)
E1 TTS: Simple and Fast Non-Autoregressive TTS
by: Liu, Zhijun, et al.
Published: (2024)
by: Liu, Zhijun, et al.
Published: (2024)
MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts
by: Xue, Heyang, et al.
Published: (2025)
by: Xue, Heyang, et al.
Published: (2025)
AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions
by: Wang, Yuanyuan, et al.
Published: (2024)
by: Wang, Yuanyuan, et al.
Published: (2024)
TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
by: Song, Xingchen, et al.
Published: (2024)
by: Song, Xingchen, et al.
Published: (2024)
Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting
by: Park, Hyun Jin, et al.
Published: (2024)
by: Park, Hyun Jin, et al.
Published: (2024)
SPAM: Style Prompt Adherence Metric for Prompt-based TTS
by: Cho, Chanhee, et al.
Published: (2026)
by: Cho, Chanhee, et al.
Published: (2026)
Accent-VITS:accent transfer for end-to-end TTS
by: Ma, Linhan, et al.
Published: (2023)
by: Ma, Linhan, et al.
Published: (2023)
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
by: Eskimez, Sefik Emre, et al.
Published: (2024)
by: Eskimez, Sefik Emre, et al.
Published: (2024)
ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages
by: Qharabagh, Mahta Fetrat, et al.
Published: (2024)
by: Qharabagh, Mahta Fetrat, et al.
Published: (2024)
Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation
by: Guo, Haohan, et al.
Published: (2024)
by: Guo, Haohan, et al.
Published: (2024)
Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study
by: An, Keyu, et al.
Published: (2024)
by: An, Keyu, et al.
Published: (2024)
UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information
by: Wang, Rui, et al.
Published: (2025)
by: Wang, Rui, et al.
Published: (2025)
I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception
by: Zhang, Jiawei, et al.
Published: (2024)
by: Zhang, Jiawei, et al.
Published: (2024)
Differentiable Reward Optimization for LLM based TTS system
by: Gao, Changfeng, et al.
Published: (2025)
by: Gao, Changfeng, et al.
Published: (2025)
EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
by: Manku, Ruskin Raj, et al.
Published: (2025)
by: Manku, Ruskin Raj, et al.
Published: (2025)
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024)
by: Chen, Xueyuan, et al.
Published: (2024)
EE-TTS: Emphatic Expressive TTS with Linguistic Information
by: Zhong, Yi, et al.
Published: (2023)
by: Zhong, Yi, et al.
Published: (2023)
The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024
by: Zhou, Shuoyi, et al.
Published: (2024)
by: Zhou, Shuoyi, et al.
Published: (2024)
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
by: Sun, Xiaohui, et al.
Published: (2025)
by: Sun, Xiaohui, et al.
Published: (2025)
DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation
by: Wang, Jianzong, et al.
Published: (2023)
by: Wang, Jianzong, et al.
Published: (2023)
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
by: Anastassiou, Philip, et al.
Published: (2024)
by: Anastassiou, Philip, et al.
Published: (2024)
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
by: Guo, Hao-Han, et al.
Published: (2024)
by: Guo, Hao-Han, et al.
Published: (2024)
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
by: Chen, Yushen, et al.
Published: (2024)
by: Chen, Yushen, et al.
Published: (2024)
FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis
by: Guo, Yinlin, et al.
Published: (2024)
by: Guo, Yinlin, et al.
Published: (2024)
Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder
by: Guo, Haohan, et al.
Published: (2024)
by: Guo, Haohan, et al.
Published: (2024)
SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models
by: Yang, Dongchao, et al.
Published: (2024)
by: Yang, Dongchao, et al.
Published: (2024)
HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS
by: Nie, Sihang, et al.
Published: (2025)
by: Nie, Sihang, et al.
Published: (2025)
Exploring synthetic data for cross-speaker style transfer in style representation based TTS
by: Ueda, Lucas H., et al.
Published: (2024)
by: Ueda, Lucas H., et al.
Published: (2024)
Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024)
by: Chen, Xueyuan, et al.
Published: (2024)
UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment
by: Wang, Yuanyuan, et al.
Published: (2026)
by: Wang, Yuanyuan, et al.
Published: (2026)
DiffDSR: Dysarthric Speech Reconstruction Using Latent Diffusion Model
by: Chen, Xueyuan, et al.
Published: (2025)
by: Chen, Xueyuan, et al.
Published: (2025)
Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT
by: Dai, Dongyang, et al.
Published: (2025)
by: Dai, Dongyang, et al.
Published: (2025)
UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models
by: Guan, Wenhao, et al.
Published: (2025)
by: Guan, Wenhao, et al.
Published: (2025)
MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis
by: Guan, Wenhao, et al.
Published: (2023)
by: Guan, Wenhao, et al.
Published: (2023)
Similar Items
-
Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy
by: Wu, Wenxuan, et al.
Published: (2024) -
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
by: Wang, Yuanyuan, et al.
Published: (2025) -
SponTTS: modeling and transferring spontaneous style for TTS
by: Li, Hanzhao, et al.
Published: (2023) -
Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets
by: Liu, Chenlin, et al.
Published: (2025) -
MBCodec:Thorough disentangle for high-fidelity audio compression
by: Zhang, Ruonan, et al.
Published: (2025)