:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cui, Jiayan, Yang, Zhihan, Li, Naihan, Tian, Jiankun, Ma, Xingyu, Zhang, Yi, Chen, Guangyu, Yang, Runxuan, Cheng, Yuqing, Zhou, Yizhi, Yu, Guochen, Gu, Xiaotao, Tang, Jie
Format:	Preprint
Published:	2025
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2512.14291
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation
by: Cheng, Yuqing, et al.
Published: (2026)

MOSS-TTS Technical Report
by: Gong, Yitian, et al.
Published: (2026)

Qwen3-TTS Technical Report
by: Hu, Hangrui, et al.
Published: (2026)

IndexTTS 2.5 Technical Report
by: Li, Yunpei, et al.
Published: (2026)

TTS-1 Technical Report
by: Atamanenko, Oleg, et al.
Published: (2025)

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS
by: Li, Haoxun, et al.
Published: (2025)

Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
by: He, Jiaxu, et al.
Published: (2026)

QuarkAudio Technical Report
by: Liu, Chengwei, et al.
Published: (2025)

MOSS-Audio Technical Report
by: Yang, Chen, et al.
Published: (2026)

SponTTS: modeling and transferring spontaneous style for TTS
by: Li, Hanzhao, et al.
Published: (2023)

MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts
by: Xue, Heyang, et al.
Published: (2025)

GLM-OCR Technical Report
by: Duan, Shuaiqi, et al.
Published: (2026)

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
by: Zeng, Aohan, et al.
Published: (2024)

Enhancing Spectrogram Realism in Singing Voice Synthesis via Explicit Bandwidth Extension Prior to Vocoder
by: Yang, Runxuan, et al.
Published: (2025)

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
by: Eskimez, Sefik Emre, et al.
Published: (2024)

Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track
by: Yi, June Young, et al.
Published: (2025)

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis
by: Guan, Wenhao, et al.
Published: (2023)

MOSS Transcribe Diarize Technical Report
by: AI, MOSI., et al.
Published: (2026)

EE-TTS: Emphatic Expressive TTS with Linguistic Information
by: Zhong, Yi, et al.
Published: (2023)

Time-Frequency-Based Attention Cache Memory Model for Real-Time Speech Separation
by: Chen, Guo, et al.
Published: (2025)

SPMamba: State-space model is all you need in speech separation
by: Li, Kai, et al.
Published: (2024)

Accent-VITS:accent transfer for end-to-end TTS
by: Ma, Linhan, et al.
Published: (2023)

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
by: Lu, Ye-Xin, et al.
Published: (2025)

Index-ASR Technical Report
by: Song, Zheshu, et al.
Published: (2025)

A Fast and Lightweight Model for Causal Audio-Visual Speech Separation
by: Sang, Wendi, et al.
Published: (2025)

OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech
by: Ren, Yong, et al.
Published: (2026)

Step-Audio 2 Technical Report
by: Wu, Boyong, et al.
Published: (2025)

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2023)

SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech
by: Gan, Lu, et al.
Published: (2025)

Covo-Audio Technical Report
by: Wang, Wenfu, et al.
Published: (2026)

I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception
by: Zhang, Jiawei, et al.
Published: (2024)

E1 TTS: Simple and Fast Non-Autoregressive TTS
by: Liu, Zhijun, et al.
Published: (2024)

TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
by: Song, Xingchen, et al.
Published: (2024)

IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023)

PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion
by: Pankov, Vikentii, et al.
Published: (2026)

ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis
by: Tang, Haobin, et al.
Published: (2024)

FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System
by: Guo, Hao-Han, et al.
Published: (2025)

FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions
by: Chen, Dekun, et al.
Published: (2026)

EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
by: Manku, Ruskin Raj, et al.
Published: (2025)

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
by: Liu, Qingyu, et al.
Published: (2025)