:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chen, Yang, Wang, Hui, Wang, Shiyao, Chen, Junyang, He, Jiabei, Zhou, Jiaming, Yang, Xi, Wang, Yequan, Lin, Yonghua, Qin, Yong
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2503.16578
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5
by: Zhou, Jiaming, et al.
Published: (2024)

CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition
by: Zhou, Jiaming, et al.
Published: (2025)

WildElder: A Chinese Elderly Speech Dataset from the Wild with Fine-Grained Manual Annotations
by: Wang, Hui, et al.
Published: (2025)

A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition
by: Wang, Shiyao, et al.
Published: (2025)

PB-LRDWWS System for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge
by: Wang, Shiyao, et al.
Published: (2024)

EchoVoices: Preserving Generational Voices and Memories for Seniors and Children
by: Xu, Haiying, et al.
Published: (2025)

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper
by: Zhou, Jiaming, et al.
Published: (2024)

CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models
by: Chen, Junyang, et al.
Published: (2026)

Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation
by: Wang, Shiyao, et al.
Published: (2024)

AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation
by: Wang, Hui, et al.
Published: (2025)

TTA-Bench: A Comprehensive Benchmark for Evaluating Text-to-Audio Models
by: Wang, Hui, et al.
Published: (2025)

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
by: Wang, Hui, et al.
Published: (2025)

RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval
by: Sun, Haoqin, et al.
Published: (2025)

RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization
by: Yang, Bing, et al.
Published: (2024)

MusicEval: A Generative Music Dataset with Expert Ratings for Automatic Text-to-Music Evaluation
by: Liu, Cheng, et al.
Published: (2025)

Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs
by: Jia, Yuhang, et al.
Published: (2025)

kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels
by: Zhou, Jiaming, et al.
Published: (2023)

Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores
by: Zhou, Jiaming, et al.
Published: (2024)

Iterative Prototype Refinement for Ambiguous Speech Emotion Recognition
by: Sun, Haoqin, et al.
Published: (2024)

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations
by: Liu, Sen, et al.
Published: (2024)

Cross-Talk Speech Reduction, by Separation, for Separation
by: Wang, Zhong-Qiu, et al.
Published: (2026)

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2024)

Generating Novel and Realistic Speakers for Voice Conversion
by: Chen, Meiying Melissa, et al.
Published: (2025)

AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework
by: Jia, Yuhang, et al.
Published: (2024)

DIFFA: Large Language Diffusion Models Can Listen and Understand
by: Zhou, Jiaming, et al.
Published: (2025)

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2024)

Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment
by: Wang, Xuechen, et al.
Published: (2024)

Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods
by: Mu, Bingshen, et al.
Published: (2025)

ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement
by: Wang, Zhong-Qiu
Published: (2024)

AFT: An Exemplar-Free Class Incremental Learning Method for Environmental Sound Classification
by: Chen, Xinyi, et al.
Published: (2025)

LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation
by: Guan, Wenhao, et al.
Published: (2024)

Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice Conversion
by: Wang, Zhichao, et al.
Published: (2023)

Combined Generative and Predictive Modeling for Speech Super-resolution
by: Wang, Heming, et al.
Published: (2024)

DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations
by: Peng, Ziqiao, et al.
Published: (2025)

Uncertainty-Aware Mean Opinion Score Prediction
by: Wang, Hui, et al.
Published: (2024)

Residual Speaker Representation for One-Shot Voice Conversion
by: Xu, Le, et al.
Published: (2023)

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion
by: Li, Yuke, et al.
Published: (2024)

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
by: Yang, Qian, et al.
Published: (2024)

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
by: Zhu, Xinfa, et al.
Published: (2025)

Cross-Talk Reduction
by: Wang, Zhong-Qiu, et al.
Published: (2024)