:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Eric, Wei, Li, Chen, Sarah, Wang, Michael
Format:	Preprint
Published:	2025
Subjects:	Sound Artificial Intelligence Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2509.14304
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Self-supervised Speech Models for Word-Level Stuttered Speech Detection
by: Shih, Yi-Jen, et al.
Published: (2024)

StutterCut: Uncertainty-Guided Normalised Cut for Dysfluency Segmentation
by: Ghosh, Suhita, et al.
Published: (2025)

Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection
by: Zhou, Xuanru, et al.
Published: (2024)

A Two-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement
by: Lu, Shenghui, et al.
Published: (2025)

A Lightweight and Real-Time Binaural Speech Enhancement Model with Spatial Cues Preservation
by: Wang, Jingyuan, et al.
Published: (2024)

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models
by: Wang, Yuxiang, et al.
Published: (2026)

RealClass: A Framework for Classroom Speech Simulation with Public Datasets and Game Engines
by: Attia, Ahmed Adel, et al.
Published: (2025)

Speech-based Clinical Depression Screening: An Empirical Study
by: Chen, Yangbin, et al.
Published: (2024)

Time-Frequency-Based Attention Cache Memory Model for Real-Time Speech Separation
by: Chen, Guo, et al.
Published: (2025)

Speech Recognition on TV Series with Video-guided Post-ASR Correction
by: Yang, Haoyuan, et al.
Published: (2025)

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing
by: Anastassiou, Philip, et al.
Published: (2024)

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech
by: Wang, Helin, et al.
Published: (2025)

Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection
by: Zhang, Jinming, et al.
Published: (2025)

MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI
by: Shah, Neil, et al.
Published: (2024)

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention
by: Wang, Cong, et al.
Published: (2025)

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play
by: Shi, Jiatong, et al.
Published: (2025)

ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge
by: Wang, He, et al.
Published: (2024)

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
by: Wang, Xinsheng, et al.
Published: (2025)

Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge
by: Xue, Hongfei, et al.
Published: (2024)

The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023
by: Wang, He, et al.
Published: (2024)

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
by: Wang, Yongqi, et al.
Published: (2023)

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline
by: Wang, Helin, et al.
Published: (2025)

MEBM-Speech: Multi-scale Enhanced BrainMagic for Robust MEG Speech Detection
by: Songyi, Li, et al.
Published: (2026)

PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation
by: Xiao, Yujia, et al.
Published: (2025)

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec
by: Guo, Yiwei, et al.
Published: (2024)

Construction and Evaluation of Mandarin Multimodal Emotional Speech Database
by: Ting, Zhu, et al.
Published: (2024)

Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition
by: Schrüfer, Oliver, et al.
Published: (2024)

Defense Against Synthetic Speech: Real-Time Detection of RVC Voice Conversion Attacks
by: Chinchmalatpure, Prajwal, et al.
Published: (2025)

Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition
by: Li, Dongyuan, et al.
Published: (2024)

A Tutorial on Clinical Speech AI Development: From Data Collection to Model Validation
by: Ng, Si-Ioi, et al.
Published: (2024)

Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation
by: Nayeem, Md., et al.
Published: (2025)

Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate
by: Zhang, Hanglei, et al.
Published: (2025)

Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment
by: Azad, Asif, et al.
Published: (2026)

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
by: Zhang, Leying, et al.
Published: (2026)

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition
by: Jiang, Yicong, et al.
Published: (2024)

Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People
by: Zhou, Haoshuai, et al.
Published: (2025)

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
by: Deng, Wei, et al.
Published: (2025)

Effective and Efficient Mixed Precision Quantization of Speech Foundation Models
by: Xu, Haoning, et al.
Published: (2025)

Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning
by: Medin, Lucas Block, et al.
Published: (2025)

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
by: Du, Zhihao, et al.
Published: (2025)