Saved in:
| Main Authors: | Li, Hanzhao, Li, Yuke, Wang, Xinsheng, Hu, Jingbin, Xie, Qicong, Yang, Shan, Xie, Lei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.04644 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
by: Xie, Hanke, et al.
Published: (2025)
by: Xie, Hanke, et al.
Published: (2025)
Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning
by: Zhu, Xinfa, et al.
Published: (2023)
by: Zhu, Xinfa, et al.
Published: (2023)
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
by: Chen, Huakang, et al.
Published: (2026)
by: Chen, Huakang, et al.
Published: (2026)
Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study
by: Chen, Peikun, et al.
Published: (2024)
by: Chen, Peikun, et al.
Published: (2024)
SCDNet: Self-supervised Learning Feature-based Speaker Change Detection
by: Li, Yue, et al.
Published: (2024)
by: Li, Yue, et al.
Published: (2024)
WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem
by: Wang, Chengyou, et al.
Published: (2026)
by: Wang, Chengyou, et al.
Published: (2026)
DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions
by: Chen, Weidong, et al.
Published: (2025)
by: Chen, Weidong, et al.
Published: (2025)
Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning
by: Tian, Wenjie, et al.
Published: (2026)
by: Tian, Wenjie, et al.
Published: (2026)
EDSep: An Effective Diffusion-Based Method for Speech Source Separation
by: Dong, Jinwei, et al.
Published: (2025)
by: Dong, Jinwei, et al.
Published: (2025)
A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition
by: Li, Yangze, et al.
Published: (2024)
by: Li, Yangze, et al.
Published: (2024)
CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion
by: Li, Yuke, et al.
Published: (2024)
by: Li, Yuke, et al.
Published: (2024)
FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot
by: Xie, Kun, et al.
Published: (2025)
by: Xie, Kun, et al.
Published: (2025)
SponTTS: modeling and transferring spontaneous style for TTS
by: Li, Hanzhao, et al.
Published: (2023)
by: Li, Hanzhao, et al.
Published: (2023)
Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification
by: Zhang, Li, et al.
Published: (2025)
by: Zhang, Li, et al.
Published: (2025)
The DKU System for Multi-Speaker Automatic Speech Recognition in MLC-SLM Challenge
by: Lin, Yuke, et al.
Published: (2025)
by: Lin, Yuke, et al.
Published: (2025)
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
by: Jiang, Yuepeng, et al.
Published: (2024)
by: Jiang, Yuepeng, et al.
Published: (2024)
Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
by: Zhao, Zhixian, et al.
Published: (2025)
by: Zhao, Zhixian, et al.
Published: (2025)
SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
by: Chen, Wenxi, et al.
Published: (2025)
by: Chen, Wenxi, et al.
Published: (2025)
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
by: Guo, Hao-Han, et al.
Published: (2024)
by: Guo, Hao-Han, et al.
Published: (2024)
VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech
by: Du, Chenpeng, et al.
Published: (2024)
by: Du, Chenpeng, et al.
Published: (2024)
EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering
by: Xie, Tianxin, et al.
Published: (2025)
by: Xie, Tianxin, et al.
Published: (2025)
Coarse-to-fine Alignment Makes Better Speech-image Retrieval
by: Zhou, Lifeng, et al.
Published: (2024)
by: Zhou, Lifeng, et al.
Published: (2024)
MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow
by: Zhu, Yike, et al.
Published: (2025)
by: Zhu, Yike, et al.
Published: (2025)
GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling
by: Yao, Jixun, et al.
Published: (2025)
by: Yao, Jixun, et al.
Published: (2025)
CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions
by: Zhu, Xinfa, et al.
Published: (2025)
by: Zhu, Xinfa, et al.
Published: (2025)
Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition
by: Jiang, Yicong, et al.
Published: (2024)
by: Jiang, Yicong, et al.
Published: (2024)
StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2024)
by: Wang, Zhichao, et al.
Published: (2024)
StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2024)
by: Wang, Zhichao, et al.
Published: (2024)
Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis
by: Tian, Wenjie, et al.
Published: (2025)
by: Tian, Wenjie, et al.
Published: (2025)
Multi-Stage Speech Bandwidth Extension with Flexible Sampling Rate Control
by: Lu, Ye-Xin, et al.
Published: (2024)
by: Lu, Ye-Xin, et al.
Published: (2024)
Acoustic BPE for Speech Generation with Discrete Tokens
by: Shen, Feiyu, et al.
Published: (2023)
by: Shen, Feiyu, et al.
Published: (2023)
SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis
by: Guo, Haohan, et al.
Published: (2024)
by: Guo, Haohan, et al.
Published: (2024)
AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition
by: Dai, Yuhang, et al.
Published: (2025)
by: Dai, Yuhang, et al.
Published: (2025)
MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech
by: Xia, Kangxiang, et al.
Published: (2025)
by: Xia, Kangxiang, et al.
Published: (2025)
Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages
by: Shao, Mingchen, et al.
Published: (2025)
by: Shao, Mingchen, et al.
Published: (2025)
Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders
by: Shan, Weiqiao, et al.
Published: (2025)
by: Shan, Weiqiao, et al.
Published: (2025)
Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition
by: Mu, Bingshen, et al.
Published: (2025)
by: Mu, Bingshen, et al.
Published: (2025)
MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition
by: Mu, Bingshen, et al.
Published: (2024)
by: Mu, Bingshen, et al.
Published: (2024)
CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition
by: Wang, He, et al.
Published: (2024)
by: Wang, He, et al.
Published: (2024)
EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs
by: Tian, Wenjie, et al.
Published: (2026)
by: Tian, Wenjie, et al.
Published: (2026)
Similar Items
-
DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
by: Xie, Hanke, et al.
Published: (2025) -
Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning
by: Zhu, Xinfa, et al.
Published: (2023) -
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
by: Chen, Huakang, et al.
Published: (2026) -
Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study
by: Chen, Peikun, et al.
Published: (2024) -
SCDNet: Self-supervised Learning Feature-based Speaker Change Detection
by: Li, Yue, et al.
Published: (2024)