Saved in:
| Main Authors: | Song, Jiaxin, Wang, Xinyu, Wang, Yihao, Tang, Yifan, Zhang, Ru, Liu, Jianyi, Liu, Gongshen |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.00907 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models
by: Zezario, Ryandhimas E., et al.
Published: (2026)
by: Zezario, Ryandhimas E., et al.
Published: (2026)
The Universal Personalizer: Few-Shot Dysarthric Speech Recognition via Meta-Learning
by: Agarwal, Dhruuv, et al.
Published: (2025)
by: Agarwal, Dhruuv, et al.
Published: (2025)
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024)
by: Pan, Tianrui, et al.
Published: (2024)
SaD: A Scenario-Aware Discriminator for Speech Enhancement
by: Yuan, Xihao, et al.
Published: (2025)
by: Yuan, Xihao, et al.
Published: (2025)
VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing
by: Anastassiou, Philip, et al.
Published: (2024)
by: Anastassiou, Philip, et al.
Published: (2024)
Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models
by: Li, Zhaoqing, et al.
Published: (2025)
by: Li, Zhaoqing, et al.
Published: (2025)
Leveraging Multimodal Methods and Spontaneous Speech for Alzheimer's Disease Identification
by: Gao, Yifan, et al.
Published: (2024)
by: Gao, Yifan, et al.
Published: (2024)
Beyond Speaker Identity: Text Guided Target Speech Extraction
by: Huo, Mingyue, et al.
Published: (2025)
by: Huo, Mingyue, et al.
Published: (2025)
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech
by: Ji, Shengpeng, et al.
Published: (2024)
by: Ji, Shengpeng, et al.
Published: (2024)
Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes
by: Glazer, Neta, et al.
Published: (2025)
by: Glazer, Neta, et al.
Published: (2025)
Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition
by: Wang, Huimeng, et al.
Published: (2025)
by: Wang, Huimeng, et al.
Published: (2025)
Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
by: Wang, Tianrui, et al.
Published: (2025)
by: Wang, Tianrui, et al.
Published: (2025)
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
by: Jiang, Ziyue, et al.
Published: (2023)
by: Jiang, Ziyue, et al.
Published: (2023)
SF-Speech: Straightened Flow for Zero-Shot Voice Clone
by: Li, Xuyuan, et al.
Published: (2024)
by: Li, Xuyuan, et al.
Published: (2024)
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
by: Jiang, Yuepeng, et al.
Published: (2024)
by: Jiang, Yuepeng, et al.
Published: (2024)
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
by: Wang, Hui, et al.
Published: (2025)
by: Wang, Hui, et al.
Published: (2025)
MiMo-Audio: Audio Language Models are Few-Shot Learners
by: Core Team, et al.
Published: (2025)
by: Core Team, et al.
Published: (2025)
CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models
by: Chen, Junyang, et al.
Published: (2026)
by: Chen, Junyang, et al.
Published: (2026)
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing
by: Zhang, Hanlin, et al.
Published: (2026)
by: Zhang, Hanlin, et al.
Published: (2026)
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
by: Dixit, Satvik, et al.
Published: (2024)
by: Dixit, Satvik, et al.
Published: (2024)
Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis
by: Hu, Yifan, et al.
Published: (2025)
by: Hu, Yifan, et al.
Published: (2025)
SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios
by: Bukhari, Hazim, et al.
Published: (2024)
by: Bukhari, Hazim, et al.
Published: (2024)
Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation
by: Yang, Mu, et al.
Published: (2024)
by: Yang, Mu, et al.
Published: (2024)
A Universal Identity Backdoor Attack against Speaker Verification based on Siamese Network
by: Zhao, Haodong, et al.
Published: (2023)
by: Zhao, Haodong, et al.
Published: (2023)
Speech Intelligibility Assessment with Uncertainty-Aware Whisper Embeddings and sLSTM
by: Zezario, Ryandhimas E., et al.
Published: (2025)
by: Zezario, Ryandhimas E., et al.
Published: (2025)
Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference
by: Dai, Shuqi, et al.
Published: (2025)
by: Dai, Shuqi, et al.
Published: (2025)
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
by: Liao, Shijia, et al.
Published: (2024)
by: Liao, Shijia, et al.
Published: (2024)
Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios
by: Subramanian, Aswin Shanmugam, et al.
Published: (2025)
by: Subramanian, Aswin Shanmugam, et al.
Published: (2025)
Few-step Adversarial Schrödinger Bridge for Generative Speech Enhancement
by: Han, Seungu, et al.
Published: (2025)
by: Han, Seungu, et al.
Published: (2025)
Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora
by: Nespoli, Francesco, et al.
Published: (2024)
by: Nespoli, Francesco, et al.
Published: (2024)
SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding
by: Wei, Linye, et al.
Published: (2025)
by: Wei, Linye, et al.
Published: (2025)
Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction
by: Dai, Wang, et al.
Published: (2025)
by: Dai, Wang, et al.
Published: (2025)
DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module
by: Wang, Xinyu, et al.
Published: (2024)
by: Wang, Xinyu, et al.
Published: (2024)
A Few-Shot Approach to Dysarthric Speech Intelligibility Level Classification Using Transformers
by: Chowdary, Paleti Nikhil, et al.
Published: (2023)
by: Chowdary, Paleti Nikhil, et al.
Published: (2023)
Few-Shot Bioacoustic Event Detection with Frame-Level Embedding Learning System
by: Zhao, PengYuan, et al.
Published: (2024)
by: Zhao, PengYuan, et al.
Published: (2024)
GALD-SE: Guided Anisotropic Lightweight Diffusion for Efficient Speech Enhancement
by: Wang, Chengzhong, et al.
Published: (2024)
by: Wang, Chengzhong, et al.
Published: (2024)
Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding Extractor
by: Si, Yongjie, et al.
Published: (2024)
by: Si, Yongjie, et al.
Published: (2024)
Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model
by: Lehečka, Jan, et al.
Published: (2024)
by: Lehečka, Jan, et al.
Published: (2024)
Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction
by: Zhang, Leying, et al.
Published: (2025)
by: Zhang, Leying, et al.
Published: (2025)
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
by: Zhang, Bowen, et al.
Published: (2025)
by: Zhang, Bowen, et al.
Published: (2025)
Similar Items
-
Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models
by: Zezario, Ryandhimas E., et al.
Published: (2026) -
The Universal Personalizer: Few-Shot Dysarthric Speech Recognition via Meta-Learning
by: Agarwal, Dhruuv, et al.
Published: (2025) -
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024) -
SaD: A Scenario-Aware Discriminator for Speech Enhancement
by: Yuan, Xihao, et al.
Published: (2025) -
VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing
by: Anastassiou, Philip, et al.
Published: (2024)