Saved in:
| Main Authors: | Cui, Jiayan, Yang, Zhihan, Li, Naihan, Tian, Jiankun, Ma, Xingyu, Zhang, Yi, Chen, Guangyu, Yang, Runxuan, Cheng, Yuqing, Zhou, Yizhi, Yu, Guochen, Gu, Xiaotao, Tang, Jie |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.14291 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation
by: Cheng, Yuqing, et al.
Published: (2026)
by: Cheng, Yuqing, et al.
Published: (2026)
MOSS-TTS Technical Report
by: Gong, Yitian, et al.
Published: (2026)
by: Gong, Yitian, et al.
Published: (2026)
Qwen3-TTS Technical Report
by: Hu, Hangrui, et al.
Published: (2026)
by: Hu, Hangrui, et al.
Published: (2026)
IndexTTS 2.5 Technical Report
by: Li, Yunpei, et al.
Published: (2026)
by: Li, Yunpei, et al.
Published: (2026)
TTS-1 Technical Report
by: Atamanenko, Oleg, et al.
Published: (2025)
by: Atamanenko, Oleg, et al.
Published: (2025)
EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS
by: Li, Haoxun, et al.
Published: (2025)
by: Li, Haoxun, et al.
Published: (2025)
Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
by: He, Jiaxu, et al.
Published: (2026)
by: He, Jiaxu, et al.
Published: (2026)
QuarkAudio Technical Report
by: Liu, Chengwei, et al.
Published: (2025)
by: Liu, Chengwei, et al.
Published: (2025)
MOSS-Audio Technical Report
by: Yang, Chen, et al.
Published: (2026)
by: Yang, Chen, et al.
Published: (2026)
SponTTS: modeling and transferring spontaneous style for TTS
by: Li, Hanzhao, et al.
Published: (2023)
by: Li, Hanzhao, et al.
Published: (2023)
MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts
by: Xue, Heyang, et al.
Published: (2025)
by: Xue, Heyang, et al.
Published: (2025)
GLM-OCR Technical Report
by: Duan, Shuaiqi, et al.
Published: (2026)
by: Duan, Shuaiqi, et al.
Published: (2026)
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
by: Zeng, Aohan, et al.
Published: (2024)
by: Zeng, Aohan, et al.
Published: (2024)
Enhancing Spectrogram Realism in Singing Voice Synthesis via Explicit Bandwidth Extension Prior to Vocoder
by: Yang, Runxuan, et al.
Published: (2025)
by: Yang, Runxuan, et al.
Published: (2025)
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
by: Eskimez, Sefik Emre, et al.
Published: (2024)
by: Eskimez, Sefik Emre, et al.
Published: (2024)
Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track
by: Yi, June Young, et al.
Published: (2025)
by: Yi, June Young, et al.
Published: (2025)
MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis
by: Guan, Wenhao, et al.
Published: (2023)
by: Guan, Wenhao, et al.
Published: (2023)
MOSS Transcribe Diarize Technical Report
by: AI, MOSI., et al.
Published: (2026)
by: AI, MOSI., et al.
Published: (2026)
EE-TTS: Emphatic Expressive TTS with Linguistic Information
by: Zhong, Yi, et al.
Published: (2023)
by: Zhong, Yi, et al.
Published: (2023)
Time-Frequency-Based Attention Cache Memory Model for Real-Time Speech Separation
by: Chen, Guo, et al.
Published: (2025)
by: Chen, Guo, et al.
Published: (2025)
SPMamba: State-space model is all you need in speech separation
by: Li, Kai, et al.
Published: (2024)
by: Li, Kai, et al.
Published: (2024)
Accent-VITS:accent transfer for end-to-end TTS
by: Ma, Linhan, et al.
Published: (2023)
by: Ma, Linhan, et al.
Published: (2023)
DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
by: Lu, Ye-Xin, et al.
Published: (2025)
by: Lu, Ye-Xin, et al.
Published: (2025)
Index-ASR Technical Report
by: Song, Zheshu, et al.
Published: (2025)
by: Song, Zheshu, et al.
Published: (2025)
A Fast and Lightweight Model for Causal Audio-Visual Speech Separation
by: Sang, Wendi, et al.
Published: (2025)
by: Sang, Wendi, et al.
Published: (2025)
OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech
by: Ren, Yong, et al.
Published: (2026)
by: Ren, Yong, et al.
Published: (2026)
Step-Audio 2 Technical Report
by: Wu, Boyong, et al.
Published: (2025)
by: Wu, Boyong, et al.
Published: (2025)
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2023)
by: Jiang, Ziyue, et al.
Published: (2023)
SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech
by: Gan, Lu, et al.
Published: (2025)
by: Gan, Lu, et al.
Published: (2025)
Covo-Audio Technical Report
by: Wang, Wenfu, et al.
Published: (2026)
by: Wang, Wenfu, et al.
Published: (2026)
I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception
by: Zhang, Jiawei, et al.
Published: (2024)
by: Zhang, Jiawei, et al.
Published: (2024)
E1 TTS: Simple and Fast Non-Autoregressive TTS
by: Liu, Zhijun, et al.
Published: (2024)
by: Liu, Zhijun, et al.
Published: (2024)
TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
by: Song, Xingchen, et al.
Published: (2024)
by: Song, Xingchen, et al.
Published: (2024)
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023)
by: Li, Kai, et al.
Published: (2023)
PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion
by: Pankov, Vikentii, et al.
Published: (2026)
by: Pankov, Vikentii, et al.
Published: (2026)
ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis
by: Tang, Haobin, et al.
Published: (2024)
by: Tang, Haobin, et al.
Published: (2024)
FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System
by: Guo, Hao-Han, et al.
Published: (2025)
by: Guo, Hao-Han, et al.
Published: (2025)
FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions
by: Chen, Dekun, et al.
Published: (2026)
by: Chen, Dekun, et al.
Published: (2026)
EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
by: Manku, Ruskin Raj, et al.
Published: (2025)
by: Manku, Ruskin Raj, et al.
Published: (2025)
Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
by: Liu, Qingyu, et al.
Published: (2025)
by: Liu, Qingyu, et al.
Published: (2025)
Similar Items
-
Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation
by: Cheng, Yuqing, et al.
Published: (2026) -
MOSS-TTS Technical Report
by: Gong, Yitian, et al.
Published: (2026) -
Qwen3-TTS Technical Report
by: Hu, Hangrui, et al.
Published: (2026) -
IndexTTS 2.5 Technical Report
by: Li, Yunpei, et al.
Published: (2026) -
TTS-1 Technical Report
by: Atamanenko, Oleg, et al.
Published: (2025)