Saved in:
| Main Authors: | Meng, Qingliang, Deng, Yuqing, Liang, Wei, Yu, Limei, Liang, Huizhi, Li, Tian |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.12001 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems
by: Meng, Qingliang, et al.
Published: (2025)
by: Meng, Qingliang, et al.
Published: (2025)
MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts
by: Xue, Heyang, et al.
Published: (2025)
by: Xue, Heyang, et al.
Published: (2025)
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
by: Peng, Puyuan, et al.
Published: (2025)
by: Peng, Puyuan, et al.
Published: (2025)
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
by: Kim, Semin, et al.
Published: (2026)
by: Kim, Semin, et al.
Published: (2026)
Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
by: Eskimez, Sefik Emre, et al.
Published: (2024)
by: Eskimez, Sefik Emre, et al.
Published: (2024)
MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition
by: Deng, Chengxi, et al.
Published: (2025)
by: Deng, Chengxi, et al.
Published: (2025)
UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition
by: Fu, Li, et al.
Published: (2024)
by: Fu, Li, et al.
Published: (2024)
EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis
by: Li, Haoxun, et al.
Published: (2025)
by: Li, Haoxun, et al.
Published: (2025)
DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
by: Lu, Ye-Xin, et al.
Published: (2025)
by: Lu, Ye-Xin, et al.
Published: (2025)
Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech
by: Li, Zirui, et al.
Published: (2025)
by: Li, Zirui, et al.
Published: (2025)
WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark
by: Ma, Linhan, et al.
Published: (2024)
by: Ma, Linhan, et al.
Published: (2024)
KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis
by: Abilbekov, Adal, et al.
Published: (2024)
by: Abilbekov, Adal, et al.
Published: (2024)
Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024)
by: Yang, Yifan, et al.
Published: (2024)
DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis
by: Li, Yinghao Aaron, et al.
Published: (2025)
by: Li, Yinghao Aaron, et al.
Published: (2025)
SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech
by: Lin, Jingru, et al.
Published: (2024)
by: Lin, Jingru, et al.
Published: (2024)
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2023)
by: Jiang, Ziyue, et al.
Published: (2023)
Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation
by: Di, Xinhan, et al.
Published: (2024)
by: Di, Xinhan, et al.
Published: (2024)
TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer
by: Bataev, Vladimir, et al.
Published: (2025)
by: Bataev, Vladimir, et al.
Published: (2025)
Adaptive Mixture of Low-Rank Experts for Robust Audio Spoofing Detection
by: Chen, Qixian, et al.
Published: (2025)
by: Chen, Qixian, et al.
Published: (2025)
MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis
by: Guan, Wenhao, et al.
Published: (2023)
by: Guan, Wenhao, et al.
Published: (2023)
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
by: Anastassiou, Philip, et al.
Published: (2024)
by: Anastassiou, Philip, et al.
Published: (2024)
DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation
by: Chen, Ziqi, et al.
Published: (2025)
by: Chen, Ziqi, et al.
Published: (2025)
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
by: Zhou, Siyi, et al.
Published: (2025)
by: Zhou, Siyi, et al.
Published: (2025)
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech
by: Kim, Jaehyeon, et al.
Published: (2024)
by: Kim, Jaehyeon, et al.
Published: (2024)
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
by: Chen, Yushen, et al.
Published: (2024)
by: Chen, Yushen, et al.
Published: (2024)
ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
by: Gong, Cheng, et al.
Published: (2023)
by: Gong, Cheng, et al.
Published: (2023)
FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis
by: Guo, Yinlin, et al.
Published: (2024)
by: Guo, Yinlin, et al.
Published: (2024)
Mixture to Beamformed Mixture: Leveraging Beamformed Mixture as Weak-Supervision for Speech Enhancement and Noise-Robust ASR
by: Wang, Zhong-Qiu, et al.
Published: (2025)
by: Wang, Zhong-Qiu, et al.
Published: (2025)
MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis
by: Singh, Jaskaran, et al.
Published: (2025)
by: Singh, Jaskaran, et al.
Published: (2025)
Enhancing In-the-Wild Speech Emotion Conversion with Resynthesis-based Duration Modeling
by: Prabhu, Navin Raj, et al.
Published: (2025)
by: Prabhu, Navin Raj, et al.
Published: (2025)
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
by: Wang, Xi, et al.
Published: (2026)
by: Wang, Xi, et al.
Published: (2026)
EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech
by: Liang, Ziqi, et al.
Published: (2024)
by: Liang, Ziqi, et al.
Published: (2024)
ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis
by: Tang, Haobin, et al.
Published: (2024)
by: Tang, Haobin, et al.
Published: (2024)
Traceable TTS: Toward Watermark-Free TTS with Strong Traceability
by: Zhao, Yuxiang, et al.
Published: (2025)
by: Zhao, Yuxiang, et al.
Published: (2025)
SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis
by: Wang, Helin, et al.
Published: (2024)
by: Wang, Helin, et al.
Published: (2024)
Technical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages
by: Pandey, Isha, et al.
Published: (2025)
by: Pandey, Isha, et al.
Published: (2025)
Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification
by: Gu, Bin, et al.
Published: (2025)
by: Gu, Bin, et al.
Published: (2025)
HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset
by: Langman, Ryan, et al.
Published: (2025)
by: Langman, Ryan, et al.
Published: (2025)
Construction and Evaluation of Mandarin Multimodal Emotional Speech Database
by: Ting, Zhu, et al.
Published: (2024)
by: Ting, Zhu, et al.
Published: (2024)
Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition
by: Tzeng, Jing-Tong, et al.
Published: (2025)
by: Tzeng, Jing-Tong, et al.
Published: (2025)
Similar Items
-
MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems
by: Meng, Qingliang, et al.
Published: (2025) -
MoE-TTS: Enhancing Out-of-Domain Text Understanding for Description-based TTS via Mixture-of-Experts
by: Xue, Heyang, et al.
Published: (2025) -
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
by: Peng, Puyuan, et al.
Published: (2025) -
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
by: Kim, Semin, et al.
Published: (2026) -
Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
by: Eskimez, Sefik Emre, et al.
Published: (2024)