Saved in:
| Main Authors: | Liu, Chang, Hu, Ya-Jun, Gao, Ying-Ying, Zhang, Shi-Lei, Ling, Zhen-Hua |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.18798 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Adapting Speech Foundation Models for Unified Multimodal Speech Recognition with Large Language Models
by: Zhang, Jing-Xuan, et al.
Published: (2025)
by: Zhang, Jing-Xuan, et al.
Published: (2025)
Group Relative Policy Optimization for Speech Recognition
by: Shivakumar, Prashanth Gurunath, et al.
Published: (2025)
by: Shivakumar, Prashanth Gurunath, et al.
Published: (2025)
B-GRPO: Unsupervised Speech Emotion Recognition based on Batched-Group Relative Policy Optimization
by: Gao, Yingying, et al.
Published: (2026)
by: Gao, Yingying, et al.
Published: (2026)
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
by: Sun, Xiaohui, et al.
Published: (2025)
by: Sun, Xiaohui, et al.
Published: (2025)
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
by: Zhang, Jing-Xuan, et al.
Published: (2025)
by: Zhang, Jing-Xuan, et al.
Published: (2025)
Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis
by: Lu, Ye-Xin, et al.
Published: (2024)
by: Lu, Ye-Xin, et al.
Published: (2024)
Universal Preference-Score-based Pairwise Speech Quality Assessment
by: Shi, Yu-Fei, et al.
Published: (2025)
by: Shi, Yu-Fei, et al.
Published: (2025)
Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation
by: Gu, Yi, et al.
Published: (2026)
by: Gu, Yi, et al.
Published: (2026)
MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra
by: Lu, Ye-Xin, et al.
Published: (2023)
by: Lu, Ye-Xin, et al.
Published: (2023)
Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks
by: Ai, Yang, et al.
Published: (2024)
by: Ai, Yang, et al.
Published: (2024)
Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024)
by: Yang, Yifan, et al.
Published: (2024)
Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages
by: Shao, Mingchen, et al.
Published: (2025)
by: Shao, Mingchen, et al.
Published: (2025)
MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech
by: Xia, Kangxiang, et al.
Published: (2025)
by: Xia, Kangxiang, et al.
Published: (2025)
Sparsity-Driven EEG Channel Selection for Brain-Assisted Speech Enhancement
by: Zhang, Jie, et al.
Published: (2023)
by: Zhang, Jie, et al.
Published: (2023)
Clever Hans Effect Found in Automatic Detection of Alzheimer's Disease through Speech
by: Liu, Yin-Long, et al.
Published: (2024)
by: Liu, Yin-Long, et al.
Published: (2024)
Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization
by: Wan, Genshun, et al.
Published: (2026)
by: Wan, Genshun, et al.
Published: (2026)
DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
by: Lu, Ye-Xin, et al.
Published: (2025)
by: Lu, Ye-Xin, et al.
Published: (2025)
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
by: Liao, Shijia, et al.
Published: (2024)
by: Liao, Shijia, et al.
Published: (2024)
Rethinking Flow and Diffusion Bridge Models for Speech Enhancement
by: Wang, Dahan, et al.
Published: (2026)
by: Wang, Dahan, et al.
Published: (2026)
Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining
by: Xue, Jinlong, et al.
Published: (2024)
by: Xue, Jinlong, et al.
Published: (2024)
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)
by: Xue, Jinlong, et al.
Published: (2024)
Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement
by: Lu, Ye-Xin, et al.
Published: (2023)
by: Lu, Ye-Xin, et al.
Published: (2023)
High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model
by: Lee, Joun Yeop, et al.
Published: (2024)
by: Lee, Joun Yeop, et al.
Published: (2024)
A High-Quality and Low-Complexity Streamable Neural Speech Codec with Knowledge Distillation
by: Zhang, En-Wei, et al.
Published: (2025)
by: Zhang, En-Wei, et al.
Published: (2025)
SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models
by: Wang, Linqin, et al.
Published: (2024)
by: Wang, Linqin, et al.
Published: (2024)
Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model
by: Wang, Siyang, et al.
Published: (2024)
by: Wang, Siyang, et al.
Published: (2024)
FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech
by: Ma, Linhan, et al.
Published: (2025)
by: Ma, Linhan, et al.
Published: (2025)
Multi-Stage Speech Bandwidth Extension with Flexible Sampling Rate Control
by: Lu, Ye-Xin, et al.
Published: (2024)
by: Lu, Ye-Xin, et al.
Published: (2024)
SELM: Speech Enhancement Using Discrete Tokens and Language Models
by: Wang, Ziqian, et al.
Published: (2023)
by: Wang, Ziqian, et al.
Published: (2023)
Beyond Manual Transcripts: The Potential of Automated Speech Recognition Errors in Improving Alzheimer's Disease Detection
by: Liu, Yin-Long, et al.
Published: (2025)
by: Liu, Yin-Long, et al.
Published: (2025)
A Survey on Speech Large Language Models for Understanding
by: Peng, Jing, et al.
Published: (2024)
by: Peng, Jing, et al.
Published: (2024)
Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
by: Cao, Junjie, et al.
Published: (2025)
by: Cao, Junjie, et al.
Published: (2025)
Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction
by: Du, Hui-Peng, et al.
Published: (2026)
by: Du, Hui-Peng, et al.
Published: (2026)
CFMDCTCodec: A Low-Bitrate Neural Speech Codec with Noise-Prior-aware Conditional Flow Matching for MDCT-Spectral Enhancement
by: Jiang, Xiao-Hang, et al.
Published: (2026)
by: Jiang, Xiao-Hang, et al.
Published: (2026)
Customizing Speech Recognition Model with Large Language Model Feedback
by: Ling, Shaoshi, et al.
Published: (2025)
by: Ling, Shaoshi, et al.
Published: (2025)
Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
by: Eskimez, Sefik Emre, et al.
Published: (2024)
by: Eskimez, Sefik Emre, et al.
Published: (2024)
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning
by: Peng, Yifan, et al.
Published: (2024)
by: Peng, Yifan, et al.
Published: (2024)
CodeSep: Low-Bitrate Codec-Driven Speech Separation with Base-Token Disentanglement and Auxiliary-Token Serial Prediction
by: Du, Hui-Peng, et al.
Published: (2026)
by: Du, Hui-Peng, et al.
Published: (2026)
Text-aware and Context-aware Expressive Audiobook Speech Synthesis
by: Guo, Dake, et al.
Published: (2024)
by: Guo, Dake, et al.
Published: (2024)
GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling
by: Yao, Jixun, et al.
Published: (2025)
by: Yao, Jixun, et al.
Published: (2025)
Similar Items
-
Adapting Speech Foundation Models for Unified Multimodal Speech Recognition with Large Language Models
by: Zhang, Jing-Xuan, et al.
Published: (2025) -
Group Relative Policy Optimization for Speech Recognition
by: Shivakumar, Prashanth Gurunath, et al.
Published: (2025) -
B-GRPO: Unsupervised Speech Emotion Recognition based on Batched-Group Relative Policy Optimization
by: Gao, Yingying, et al.
Published: (2026) -
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
by: Sun, Xiaohui, et al.
Published: (2025) -
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
by: Zhang, Jing-Xuan, et al.
Published: (2025)