:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Leying, Qian, Yao, Yu, Linfeng, Wang, Heming, Yang, Hemin, Zhou, Long, Liu, Shujie, Qian, Yanmin
Format:	Preprint
Published:	2023
Subjects:	Audio and Speech Processing Machine Learning Sound
Online Access:	https://arxiv.org/abs/2309.13874
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction
by: Zhang, Leying, et al.
Published: (2025)

Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling
by: Zhang, Leying, et al.
Published: (2024)

Diffusion-based Generative Modeling with Discriminative Guidance for Streamable Speech Enhancement
by: Li, Chenda, et al.
Published: (2024)

DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice
by: Zhang, Leying, et al.
Published: (2026)

SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation
by: Lu, Haitian, et al.
Published: (2025)

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
by: Zhang, Leying, et al.
Published: (2026)

Advanced Long-Content Speech Recognition With Factorized Neural Transducer
by: Gong, Xun, et al.
Published: (2024)

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
by: Zhang, Leying, et al.
Published: (2024)

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
by: Le, Chenyang, et al.
Published: (2024)

From Sharpness to Better Generalization for Speech Deepfake Detection
by: Huang, Wen, et al.
Published: (2025)

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction
by: Wang, Shuai, et al.
Published: (2024)

Improving Design of Input Condition Invariant Speech Enhancement
by: Zhang, Wangyou, et al.
Published: (2024)

SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods
by: Huang, Wen, et al.
Published: (2025)

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems
by: Chen, Zhengyang, et al.
Published: (2024)

A Unified Neural Codec Language Model for Selective Editable Text to Speech Generation
by: Pei, Hanchen, et al.
Published: (2026)

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
by: Li, Jiaqi, et al.
Published: (2024)

Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization
by: Liu, Bei, et al.
Published: (2024)

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching
by: Zhang, Leying, et al.
Published: (2025)

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
by: Chen, Sanyuan, et al.
Published: (2024)

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching
by: Chen, Zhengyang, et al.
Published: (2024)

USED: Universal Speaker Extraction and Diarization
by: Ao, Junyi, et al.
Published: (2023)

Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models
by: Zeng, Bang, et al.
Published: (2026)

Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment
by: Wang, Wei, et al.
Published: (2025)

Boosting Large Language Model for Speech Synthesis: An Empirical Study
by: Hao, Hongkun, et al.
Published: (2023)

Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification
by: Liu, Bei, et al.
Published: (2024)

Text adaptation for speaker verification with speaker-text factorized embeddings
by: Yang, Yexin, et al.
Published: (2025)

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement
by: Zhang, Wangyou, et al.
Published: (2024)

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings
by: Zhao, He, et al.
Published: (2024)

Toward Universal Speech Enhancement for Diverse Input Conditions
by: Zhang, Wangyou, et al.
Published: (2023)

BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM
by: Gong, Xun, et al.
Published: (2025)

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
by: Xin, Detai, et al.
Published: (2026)

Probing Self-supervised Learning Models with Target Speech Extraction
by: Peng, Junyi, et al.
Published: (2024)

DQ-Whisper: Joint Distillation and Quantization for Efficient Multilingual Speech Recognition
by: Shao, Hang, et al.
Published: (2023)

Combined Generative and Predictive Modeling for Speech Super-resolution
by: Wang, Heming, et al.
Published: (2024)

SpatialCodec: Neural Spatial Speech Coding
by: Xu, Zhongweiyang, et al.
Published: (2023)

Beyond Speaker Identity: Text Guided Target Speech Extraction
by: Huo, Mingyue, et al.
Published: (2025)

Target Speech Extraction with Pre-trained Self-supervised Learning Models
by: Peng, Junyi, et al.
Published: (2024)

Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion
by: Chen, Zhengyang, et al.
Published: (2024)

Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios
by: Yang, Yiming, et al.
Published: (2026)

DENSE: Dynamic Embedding Causal Target Speech Extraction
by: Wang, Yiwen, et al.
Published: (2024)