Saved in:
| Main Authors: | Rehman, Abdul, Cai, Jingyao, Zhang, Jian-Jun, Yang, Xiaosong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.23147 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception
by: Zhang, Jiawei, et al.
Published: (2024)
by: Zhang, Jiawei, et al.
Published: (2024)
AmbER$^2$: Dual Ambiguity-Aware Emotion Recognition Applied to Speech and Text
by: Wu, Jingyao, et al.
Published: (2026)
by: Wu, Jingyao, et al.
Published: (2026)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
by: Du, Zhihao, et al.
Published: (2024)
by: Du, Zhihao, et al.
Published: (2024)
DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation
by: Wang, Jianzong, et al.
Published: (2023)
by: Wang, Jianzong, et al.
Published: (2023)
Real-time multichannel deep speech enhancement in hearing aids: Comparing monaural and binaural processing in complex acoustic scenarios
by: Westhausen, Nils L., et al.
Published: (2024)
by: Westhausen, Nils L., et al.
Published: (2024)
LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech
by: Mu, Bingshen, et al.
Published: (2026)
by: Mu, Bingshen, et al.
Published: (2026)
WhisperFlow: speech foundation models in real time
by: Wang, Rongxiang, et al.
Published: (2024)
by: Wang, Rongxiang, et al.
Published: (2024)
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
by: Gong, Jingyao
Published: (2026)
by: Gong, Jingyao
Published: (2026)
Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition
by: Zhang, Yiru, et al.
Published: (2025)
by: Zhang, Yiru, et al.
Published: (2025)
Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text
by: Xue, Hongfei, et al.
Published: (2024)
by: Xue, Hongfei, et al.
Published: (2024)
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
by: Chen, Huakang, et al.
Published: (2026)
by: Chen, Huakang, et al.
Published: (2026)
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
by: Liao, Shijia, et al.
Published: (2024)
by: Liao, Shijia, et al.
Published: (2024)
Towards noise-robust speech inversion through multi-task learning with speech enhancement
by: Tabatabaee, Saba, et al.
Published: (2026)
by: Tabatabaee, Saba, et al.
Published: (2026)
StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis
by: Chen, Zhiyong, et al.
Published: (2024)
by: Chen, Zhiyong, et al.
Published: (2024)
FreeCodec: A disentangled neural speech codec with fewer tokens
by: Zheng, Youqiang, et al.
Published: (2024)
by: Zheng, Youqiang, et al.
Published: (2024)
On the relationship between speech and hearing
by: Umesh, Srinivasan, et al.
Published: (2024)
by: Umesh, Srinivasan, et al.
Published: (2024)
Robust fine-tuning of speech recognition models via model merging: application to disordered speech
by: Ducorroy, Alexandre, et al.
Published: (2025)
by: Ducorroy, Alexandre, et al.
Published: (2025)
DualSep: A Light-weight dual-encoder convolutional recurrent network for real-time in-car speech separation
by: Wang, Ziqian, et al.
Published: (2024)
by: Wang, Ziqian, et al.
Published: (2024)
On the effectiveness of enrollment speech augmentation for Target Speaker Extraction
by: Li, Junjie, et al.
Published: (2024)
by: Li, Junjie, et al.
Published: (2024)
An adaptive filter bank based neural network approach for time delay estimation and speech enhancement
by: Ma, Lu
Published: (2025)
by: Ma, Lu
Published: (2025)
An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech
by: Deng, Qingkun, et al.
Published: (2024)
by: Deng, Qingkun, et al.
Published: (2024)
Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty
by: Xue, Hongfei, et al.
Published: (2025)
by: Xue, Hongfei, et al.
Published: (2025)
A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation
by: Raghu, Ananya, et al.
Published: (2025)
by: Raghu, Ananya, et al.
Published: (2025)
Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder
by: Li, Xuyuan, et al.
Published: (2023)
by: Li, Xuyuan, et al.
Published: (2023)
SPGM: Prioritizing Local Features for enhanced speech separation performance
by: Yip, Jia Qi, et al.
Published: (2023)
by: Yip, Jia Qi, et al.
Published: (2023)
Adversarial speech for voice privacy protection from Personalized Speech generation
by: Chen, Shihao, et al.
Published: (2024)
by: Chen, Shihao, et al.
Published: (2024)
Weighted-Sampling Audio Adversarial Example Attack
by: Liu, Xiaolei, et al.
Published: (2019)
by: Liu, Xiaolei, et al.
Published: (2019)
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
by: Zhang, Xueyao, et al.
Published: (2025)
by: Zhang, Xueyao, et al.
Published: (2025)
Automated evaluation of children's speech fluency for low-resource languages
by: Zhang, Bowen, et al.
Published: (2025)
by: Zhang, Bowen, et al.
Published: (2025)
Distilling a speech and music encoder with task arithmetic
by: Ritter-Gutierrez, Fabian, et al.
Published: (2025)
by: Ritter-Gutierrez, Fabian, et al.
Published: (2025)
Probing mental health information in speech foundation models
by: de Gennes, Marc, et al.
Published: (2024)
by: de Gennes, Marc, et al.
Published: (2024)
Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition
by: An, Keyu, et al.
Published: (2024)
by: An, Keyu, et al.
Published: (2024)
Selective Classifier-free Guidance for Zero-shot Text-to-speech
by: Zheng, John, et al.
Published: (2025)
by: Zheng, John, et al.
Published: (2025)
HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
by: Wang, Chunhui, et al.
Published: (2024)
by: Wang, Chunhui, et al.
Published: (2024)
Multi-speaker Text-to-speech Training with Speaker Anonymized Data
by: Huang, Wen-Chin, et al.
Published: (2024)
by: Huang, Wen-Chin, et al.
Published: (2024)
Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
by: Choi, Jeongsoo, et al.
Published: (2025)
by: Choi, Jeongsoo, et al.
Published: (2025)
Omni-directional attention mechanism based on Mamba for speech separation
by: Xue, Ke, et al.
Published: (2026)
by: Xue, Ke, et al.
Published: (2026)
Unsupervised speech enhancement with spectral kurtosis and double deep priors
by: Ohnaka, Hien, et al.
Published: (2024)
by: Ohnaka, Hien, et al.
Published: (2024)
Inter-channel Conv-TasNet for multichannel speech enhancement
by: Lee, Dongheon, et al.
Published: (2021)
by: Lee, Dongheon, et al.
Published: (2021)
Improving child speech recognition with augmented child-like speech
by: Zhang, Yuanyuan, et al.
Published: (2024)
by: Zhang, Yuanyuan, et al.
Published: (2024)
Similar Items
-
I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception
by: Zhang, Jiawei, et al.
Published: (2024) -
AmbER$^2$: Dual Ambiguity-Aware Emotion Recognition Applied to Speech and Text
by: Wu, Jingyao, et al.
Published: (2026) -
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
by: Du, Zhihao, et al.
Published: (2024) -
DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation
by: Wang, Jianzong, et al.
Published: (2023) -
Real-time multichannel deep speech enhancement in hearing aids: Comparing monaural and binaural processing in complex acoustic scenarios
by: Westhausen, Nils L., et al.
Published: (2024)