:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Ruonan, Mu, Lingzhou, Wu, Xixin, Zhang, Kai
Format:	Preprint
Published:	2025
Subjects:	Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2509.17021
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy
by: Wu, Wenxuan, et al.
Published: (2024)

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
by: Wang, Yuanyuan, et al.
Published: (2025)

SponTTS: modeling and transferring spontaneous style for TTS
by: Li, Hanzhao, et al.
Published: (2023)

Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets
by: Liu, Chenlin, et al.
Published: (2025)

MBCodec:Thorough disentangle for high-fidelity audio compression
by: Zhang, Ruonan, et al.
Published: (2025)

DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions
by: Chen, Weidong, et al.
Published: (2025)

E1 TTS: Simple and Fast Non-Autoregressive TTS
by: Liu, Zhijun, et al.
Published: (2024)

MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts
by: Xue, Heyang, et al.
Published: (2025)

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions
by: Wang, Yuanyuan, et al.
Published: (2024)

TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
by: Song, Xingchen, et al.
Published: (2024)

Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting
by: Park, Hyun Jin, et al.
Published: (2024)

SPAM: Style Prompt Adherence Metric for Prompt-based TTS
by: Cho, Chanhee, et al.
Published: (2026)

Accent-VITS:accent transfer for end-to-end TTS
by: Ma, Linhan, et al.
Published: (2023)

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
by: Eskimez, Sefik Emre, et al.
Published: (2024)

ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages
by: Qharabagh, Mahta Fetrat, et al.
Published: (2024)

Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation
by: Guo, Haohan, et al.
Published: (2024)

Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study
by: An, Keyu, et al.
Published: (2024)

UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information
by: Wang, Rui, et al.
Published: (2025)

I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception
by: Zhang, Jiawei, et al.
Published: (2024)

Differentiable Reward Optimization for LLM based TTS system
by: Gao, Changfeng, et al.
Published: (2025)

EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
by: Manku, Ruskin Raj, et al.
Published: (2025)

CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024)

EE-TTS: Emphatic Expressive TTS with Linguistic Information
by: Zhong, Yi, et al.
Published: (2023)

The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024
by: Zhou, Shuoyi, et al.
Published: (2024)

F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
by: Sun, Xiaohui, et al.
Published: (2025)

DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation
by: Wang, Jianzong, et al.
Published: (2023)

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
by: Anastassiou, Philip, et al.
Published: (2024)

FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
by: Guo, Hao-Han, et al.
Published: (2024)

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
by: Chen, Yushen, et al.
Published: (2024)

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis
by: Guo, Yinlin, et al.
Published: (2024)

Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder
by: Guo, Haohan, et al.
Published: (2024)

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models
by: Yang, Dongchao, et al.
Published: (2024)

HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS
by: Nie, Sihang, et al.
Published: (2025)

Exploring synthetic data for cross-speaker style transfer in style representation based TTS
by: Ueda, Lucas H., et al.
Published: (2024)

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024)

UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment
by: Wang, Yuanyuan, et al.
Published: (2026)

DiffDSR: Dysarthric Speech Reconstruction Using Latent Diffusion Model
by: Chen, Xueyuan, et al.
Published: (2025)

Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT
by: Dai, Dongyang, et al.
Published: (2025)

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models
by: Guan, Wenhao, et al.
Published: (2025)

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis
by: Guan, Wenhao, et al.
Published: (2023)