:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Song, Jiaxin, Wang, Xinyu, Wang, Yihao, Tang, Yifan, Zhang, Ru, Liu, Jianyi, Liu, Gongshen
Format:	Preprint
Published:	2025
Subjects:	Sound Computation and Language Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2501.00907
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models
by: Zezario, Ryandhimas E., et al.
Published: (2026)

The Universal Personalizer: Few-Shot Dysarthric Speech Recognition via Meta-Learning
by: Agarwal, Dhruuv, et al.
Published: (2025)

RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024)

SaD: A Scenario-Aware Discriminator for Speech Enhancement
by: Yuan, Xihao, et al.
Published: (2025)

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing
by: Anastassiou, Philip, et al.
Published: (2024)

Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models
by: Li, Zhaoqing, et al.
Published: (2025)

Leveraging Multimodal Methods and Spontaneous Speech for Alzheimer's Disease Identification
by: Gao, Yifan, et al.
Published: (2024)

Beyond Speaker Identity: Text Guided Target Speech Extraction
by: Huo, Mingyue, et al.
Published: (2025)

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech
by: Ji, Shengpeng, et al.
Published: (2024)

Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes
by: Glazer, Neta, et al.
Published: (2025)

Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition
by: Wang, Huimeng, et al.
Published: (2025)

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
by: Wang, Tianrui, et al.
Published: (2025)

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2023)

SF-Speech: Straightened Flow for Zero-Shot Voice Clone
by: Li, Xuyuan, et al.
Published: (2024)

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
by: Jiang, Yuepeng, et al.
Published: (2024)

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
by: Wang, Hui, et al.
Published: (2025)

MiMo-Audio: Audio Language Models are Few-Shot Learners
by: Core Team, et al.
Published: (2025)

CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models
by: Chen, Junyang, et al.
Published: (2026)

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing
by: Zhang, Hanlin, et al.
Published: (2026)

Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
by: Dixit, Satvik, et al.
Published: (2024)

Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis
by: Hu, Yifan, et al.
Published: (2025)

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios
by: Bukhari, Hazim, et al.
Published: (2024)

Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation
by: Yang, Mu, et al.
Published: (2024)

A Universal Identity Backdoor Attack against Speaker Verification based on Siamese Network
by: Zhao, Haodong, et al.
Published: (2023)

Speech Intelligibility Assessment with Uncertainty-Aware Whisper Embeddings and sLSTM
by: Zezario, Ryandhimas E., et al.
Published: (2025)

Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference
by: Dai, Shuqi, et al.
Published: (2025)

Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
by: Liao, Shijia, et al.
Published: (2024)

Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios
by: Subramanian, Aswin Shanmugam, et al.
Published: (2025)

Few-step Adversarial Schrödinger Bridge for Generative Speech Enhancement
by: Han, Seungu, et al.
Published: (2025)

Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora
by: Nespoli, Francesco, et al.
Published: (2024)

SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding
by: Wei, Linye, et al.
Published: (2025)

Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction
by: Dai, Wang, et al.
Published: (2025)

DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module
by: Wang, Xinyu, et al.
Published: (2024)

A Few-Shot Approach to Dysarthric Speech Intelligibility Level Classification Using Transformers
by: Chowdary, Paleti Nikhil, et al.
Published: (2023)

Few-Shot Bioacoustic Event Detection with Frame-Level Embedding Learning System
by: Zhao, PengYuan, et al.
Published: (2024)

GALD-SE: Guided Anisotropic Lightweight Diffusion for Efficient Speech Enhancement
by: Wang, Chengzhong, et al.
Published: (2024)

Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding Extractor
by: Si, Yongjie, et al.
Published: (2024)

Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model
by: Lehečka, Jan, et al.
Published: (2024)

Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction
by: Zhang, Leying, et al.
Published: (2025)

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
by: Zhang, Bowen, et al.
Published: (2025)