:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Hanzhao, Li, Yuke, Wang, Xinsheng, Hu, Jingbin, Xie, Qicong, Yang, Shan, Xie, Lei
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing Sound
Online Access:	https://arxiv.org/abs/2501.04644
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
by: Xie, Hanke, et al.
Published: (2025)

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning
by: Zhu, Xinfa, et al.
Published: (2023)

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
by: Chen, Huakang, et al.
Published: (2026)

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study
by: Chen, Peikun, et al.
Published: (2024)

SCDNet: Self-supervised Learning Feature-based Speaker Change Detection
by: Li, Yue, et al.
Published: (2024)

WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem
by: Wang, Chengyou, et al.
Published: (2026)

DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions
by: Chen, Weidong, et al.
Published: (2025)

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning
by: Tian, Wenjie, et al.
Published: (2026)

EDSep: An Effective Diffusion-Based Method for Speech Source Separation
by: Dong, Jinwei, et al.
Published: (2025)

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition
by: Li, Yangze, et al.
Published: (2024)

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion
by: Li, Yuke, et al.
Published: (2024)

FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot
by: Xie, Kun, et al.
Published: (2025)

SponTTS: modeling and transferring spontaneous style for TTS
by: Li, Hanzhao, et al.
Published: (2023)

Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification
by: Zhang, Li, et al.
Published: (2025)

The DKU System for Multi-Speaker Automatic Speech Recognition in MLC-SLM Challenge
by: Lin, Yuke, et al.
Published: (2025)

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
by: Jiang, Yuepeng, et al.
Published: (2024)

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
by: Zhao, Zhixian, et al.
Published: (2025)

SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
by: Chen, Wenxi, et al.
Published: (2025)

FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
by: Guo, Hao-Han, et al.
Published: (2024)

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech
by: Du, Chenpeng, et al.
Published: (2024)

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering
by: Xie, Tianxin, et al.
Published: (2025)

Coarse-to-fine Alignment Makes Better Speech-image Retrieval
by: Zhou, Lifeng, et al.
Published: (2024)

MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow
by: Zhu, Yike, et al.
Published: (2025)

GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling
by: Yao, Jixun, et al.
Published: (2025)

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions
by: Zhu, Xinfa, et al.
Published: (2025)

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition
by: Jiang, Yicong, et al.
Published: (2024)

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2024)

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2024)

Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis
by: Tian, Wenjie, et al.
Published: (2025)

Multi-Stage Speech Bandwidth Extension with Flexible Sampling Rate Control
by: Lu, Ye-Xin, et al.
Published: (2024)

Acoustic BPE for Speech Generation with Discrete Tokens
by: Shen, Feiyu, et al.
Published: (2023)

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis
by: Guo, Haohan, et al.
Published: (2024)

AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition
by: Dai, Yuhang, et al.
Published: (2025)

MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech
by: Xia, Kangxiang, et al.
Published: (2025)

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages
by: Shao, Mingchen, et al.
Published: (2025)

Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders
by: Shan, Weiqiao, et al.
Published: (2025)

Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition
by: Mu, Bingshen, et al.
Published: (2025)

MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition
by: Mu, Bingshen, et al.
Published: (2024)

CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition
by: Wang, He, et al.
Published: (2024)

EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs
by: Tian, Wenjie, et al.
Published: (2026)