:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Meng, Qingliang, Deng, Yuqing, Liang, Wei, Yu, Limei, Liang, Huizhi, Li, Tian
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2508.12001
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems
by: Meng, Qingliang, et al.
Published: (2025)

MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts
by: Xue, Heyang, et al.
Published: (2025)

VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
by: Peng, Puyuan, et al.
Published: (2025)

Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
by: Kim, Semin, et al.
Published: (2026)

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
by: Eskimez, Sefik Emre, et al.
Published: (2024)

MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition
by: Deng, Chengxi, et al.
Published: (2025)

UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition
by: Fu, Li, et al.
Published: (2024)

EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis
by: Li, Haoxun, et al.
Published: (2025)

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
by: Lu, Ye-Xin, et al.
Published: (2025)

Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech
by: Li, Zirui, et al.
Published: (2025)

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark
by: Ma, Linhan, et al.
Published: (2024)

KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis
by: Abilbekov, Adal, et al.
Published: (2024)

Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024)

DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis
by: Li, Yinghao Aaron, et al.
Published: (2025)

SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech
by: Lin, Jingru, et al.
Published: (2024)

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2023)

Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation
by: Di, Xinhan, et al.
Published: (2024)

TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer
by: Bataev, Vladimir, et al.
Published: (2025)

Adaptive Mixture of Low-Rank Experts for Robust Audio Spoofing Detection
by: Chen, Qixian, et al.
Published: (2025)

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis
by: Guan, Wenhao, et al.
Published: (2023)

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
by: Anastassiou, Philip, et al.
Published: (2024)

DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation
by: Chen, Ziqi, et al.
Published: (2025)

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
by: Zhou, Siyi, et al.
Published: (2025)

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
by: Kim, Jaehyeon, et al.
Published: (2024)

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
by: Chen, Yushen, et al.
Published: (2024)

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
by: Gong, Cheng, et al.
Published: (2023)

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis
by: Guo, Yinlin, et al.
Published: (2024)

Mixture to Beamformed Mixture: Leveraging Beamformed Mixture as Weak-Supervision for Speech Enhancement and Noise-Robust ASR
by: Wang, Zhong-Qiu, et al.
Published: (2025)

MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis
by: Singh, Jaskaran, et al.
Published: (2025)

Enhancing In-the-Wild Speech Emotion Conversion with Resynthesis-based Duration Modeling
by: Prabhu, Navin Raj, et al.
Published: (2025)

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
by: Wang, Xi, et al.
Published: (2026)

EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech
by: Liang, Ziqi, et al.
Published: (2024)

ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis
by: Tang, Haobin, et al.
Published: (2024)

Traceable TTS: Toward Watermark-Free TTS with Strong Traceability
by: Zhao, Yuxiang, et al.
Published: (2025)

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis
by: Wang, Helin, et al.
Published: (2024)

Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages
by: Pandey, Isha, et al.
Published: (2025)

Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification
by: Gu, Bin, et al.
Published: (2025)

HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset
by: Langman, Ryan, et al.
Published: (2025)

Construction and Evaluation of Mandarin Multimodal Emotional Speech Database
by: Ting, Zhu, et al.
Published: (2024)

Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition
by: Tzeng, Jing-Tong, et al.
Published: (2025)