:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Rehman, Abdul, Cai, Jingyao, Zhang, Jian-Jun, Yang, Xiaosong
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing Sound
Online Access:	https://arxiv.org/abs/2509.23147
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception
by: Zhang, Jiawei, et al.
Published: (2024)

AmbER$^2$: Dual Ambiguity-Aware Emotion Recognition Applied to Speech and Text
by: Wu, Jingyao, et al.
Published: (2026)

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
by: Du, Zhihao, et al.
Published: (2024)

DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation
by: Wang, Jianzong, et al.
Published: (2023)

Real-time multichannel deep speech enhancement in hearing aids: Comparing monaural and binaural processing in complex acoustic scenarios
by: Westhausen, Nils L., et al.
Published: (2024)

LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech
by: Mu, Bingshen, et al.
Published: (2026)

WhisperFlow: speech foundation models in real time
by: Wang, Rongxiang, et al.
Published: (2024)

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
by: Gong, Jingyao
Published: (2026)

Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition
by: Zhang, Yiru, et al.
Published: (2025)

Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text
by: Xue, Hongfei, et al.
Published: (2024)

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
by: Chen, Huakang, et al.
Published: (2026)

Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
by: Liao, Shijia, et al.
Published: (2024)

Towards noise-robust speech inversion through multi-task learning with speech enhancement
by: Tabatabaee, Saba, et al.
Published: (2026)

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis
by: Chen, Zhiyong, et al.
Published: (2024)

FreeCodec: A disentangled neural speech codec with fewer tokens
by: Zheng, Youqiang, et al.
Published: (2024)

On the relationship between speech and hearing
by: Umesh, Srinivasan, et al.
Published: (2024)

Robust fine-tuning of speech recognition models via model merging: application to disordered speech
by: Ducorroy, Alexandre, et al.
Published: (2025)

DualSep: A Light-weight dual-encoder convolutional recurrent network for real-time in-car speech separation
by: Wang, Ziqian, et al.
Published: (2024)

On the effectiveness of enrollment speech augmentation for Target Speaker Extraction
by: Li, Junjie, et al.
Published: (2024)

An adaptive filter bank based neural network approach for time delay estimation and speech enhancement
by: Ma, Lu
Published: (2025)

An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech
by: Deng, Qingkun, et al.
Published: (2024)

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty
by: Xue, Hongfei, et al.
Published: (2025)

A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation
by: Raghu, Ananya, et al.
Published: (2025)

Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder
by: Li, Xuyuan, et al.
Published: (2023)

SPGM: Prioritizing Local Features for enhanced speech separation performance
by: Yip, Jia Qi, et al.
Published: (2023)

Adversarial speech for voice privacy protection from Personalized Speech generation
by: Chen, Shihao, et al.
Published: (2024)

Weighted-Sampling Audio Adversarial Example Attack
by: Liu, Xiaolei, et al.
Published: (2019)

Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
by: Zhang, Xueyao, et al.
Published: (2025)

Automated evaluation of children's speech fluency for low-resource languages
by: Zhang, Bowen, et al.
Published: (2025)

Distilling a speech and music encoder with task arithmetic
by: Ritter-Gutierrez, Fabian, et al.
Published: (2025)

Probing mental health information in speech foundation models
by: de Gennes, Marc, et al.
Published: (2024)

Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition
by: An, Keyu, et al.
Published: (2024)

Selective Classifier-free Guidance for Zero-shot Text-to-speech
by: Zheng, John, et al.
Published: (2025)

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
by: Wang, Chunhui, et al.
Published: (2024)

Multi-speaker Text-to-speech Training with Speaker Anonymized Data
by: Huang, Wen-Chin, et al.
Published: (2024)

Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
by: Choi, Jeongsoo, et al.
Published: (2025)

Omni-directional attention mechanism based on Mamba for speech separation
by: Xue, Ke, et al.
Published: (2026)

Unsupervised speech enhancement with spectral kurtosis and double deep priors
by: Ohnaka, Hien, et al.
Published: (2024)

Inter-channel Conv-TasNet for multichannel speech enhancement
by: Lee, Dongheon, et al.
Published: (2021)

Improving child speech recognition with augmented child-like speech
by: Zhang, Yuanyuan, et al.
Published: (2024)