Saved in:
| Main Authors: | Wang, Guansu, Sun, Peijie |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.17555 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis
by: Do, Cong-Thanh, et al.
Published: (2024)
by: Do, Cong-Thanh, et al.
Published: (2024)
Speech Recognition Rescoring with Large Speech-Text Foundation Models
by: Shivakumar, Prashanth Gurunath, et al.
Published: (2024)
by: Shivakumar, Prashanth Gurunath, et al.
Published: (2024)
Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
by: Cornell, Samuele, et al.
Published: (2024)
by: Cornell, Samuele, et al.
Published: (2024)
Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation
by: Huang, Ruizhe, et al.
Published: (2024)
by: Huang, Ruizhe, et al.
Published: (2024)
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
by: Wang, Xi, et al.
Published: (2026)
by: Wang, Xi, et al.
Published: (2026)
Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text
by: Li, Jinpeng, et al.
Published: (2024)
by: Li, Jinpeng, et al.
Published: (2024)
Continuous Speech Tokenizer in Text To Speech
by: Li, Yixing, et al.
Published: (2024)
by: Li, Yixing, et al.
Published: (2024)
Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
by: Chien, Chung-Ming, et al.
Published: (2024)
by: Chien, Chung-Ming, et al.
Published: (2024)
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning
by: Peng, Yifan, et al.
Published: (2024)
by: Peng, Yifan, et al.
Published: (2024)
Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis
by: Zhou, Kun, et al.
Published: (2024)
by: Zhou, Kun, et al.
Published: (2024)
Streaming Speech-to-Confusion Network Speech Recognition
by: Filimonov, Denis, et al.
Published: (2023)
by: Filimonov, Denis, et al.
Published: (2023)
Improving the Inclusivity of Dutch Speech Recognition by Fine-tuning Whisper on the JASMIN-CGN Corpus
by: Shekoufandeh, Golshid, et al.
Published: (2025)
by: Shekoufandeh, Golshid, et al.
Published: (2025)
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2025)
by: Yang, Yifan, et al.
Published: (2025)
Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model
by: Wu, Haibin, et al.
Published: (2025)
by: Wu, Haibin, et al.
Published: (2025)
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
by: Lu, Ke-Han, et al.
Published: (2024)
by: Lu, Ke-Han, et al.
Published: (2024)
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection
by: Lin, Hsi-Che, et al.
Published: (2024)
by: Lin, Hsi-Che, et al.
Published: (2024)
UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models
by: Tu, Wenming, et al.
Published: (2025)
by: Tu, Wenming, et al.
Published: (2025)
FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
by: Wang, Hui, et al.
Published: (2025)
by: Wang, Hui, et al.
Published: (2025)
MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems
by: Meng, Qingliang, et al.
Published: (2025)
by: Meng, Qingliang, et al.
Published: (2025)
Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition
by: Wang, Yujin, et al.
Published: (2022)
by: Wang, Yujin, et al.
Published: (2022)
Speaker-Aware Simulation Improves Conversational Speech Recognition
by: Gedeon, Máté, et al.
Published: (2026)
by: Gedeon, Máté, et al.
Published: (2026)
MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis
by: Singh, Jaskaran, et al.
Published: (2025)
by: Singh, Jaskaran, et al.
Published: (2025)
Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC
by: Wang, Qingzheng, et al.
Published: (2025)
by: Wang, Qingzheng, et al.
Published: (2025)
An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis
by: Peng, Yifan, et al.
Published: (2024)
by: Peng, Yifan, et al.
Published: (2024)
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
by: Chen, Chen, et al.
Published: (2024)
by: Chen, Chen, et al.
Published: (2024)
Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems
by: Allbert, Rumi, et al.
Published: (2025)
by: Allbert, Rumi, et al.
Published: (2025)
Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning
by: Fang, Yangui, et al.
Published: (2025)
by: Fang, Yangui, et al.
Published: (2025)
Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition
by: Tseng, Yuan, et al.
Published: (2025)
by: Tseng, Yuan, et al.
Published: (2025)
SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation
by: Du, Jiayu, et al.
Published: (2024)
by: Du, Jiayu, et al.
Published: (2024)
Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs
by: Futami, Hayato, et al.
Published: (2025)
by: Futami, Hayato, et al.
Published: (2025)
Chain of Correction for Full-text Speech Recognition with Large Language Models
by: Tang, Zhiyuan, et al.
Published: (2025)
by: Tang, Zhiyuan, et al.
Published: (2025)
TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding
by: Huo, Mingyue, et al.
Published: (2026)
by: Huo, Mingyue, et al.
Published: (2026)
On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition
by: Rossenbach, Nick, et al.
Published: (2024)
by: Rossenbach, Nick, et al.
Published: (2024)
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
Multi-stage Large Language Model Correction for Speech Recognition
by: Pu, Jie, et al.
Published: (2023)
by: Pu, Jie, et al.
Published: (2023)
Revisiting Interpolation Augmentation for Speech-to-Text Generation
by: Xu, Chen, et al.
Published: (2024)
by: Xu, Chen, et al.
Published: (2024)
Full-text Error Correction for Chinese Speech Recognition with Large Language Model
by: Tang, Zhiyuan, et al.
Published: (2024)
by: Tang, Zhiyuan, et al.
Published: (2024)
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)
by: Xue, Jinlong, et al.
Published: (2024)
Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations
by: Meghanani, Amit, et al.
Published: (2026)
by: Meghanani, Amit, et al.
Published: (2026)
Towards Unsupervised Speech Recognition Without Pronunciation Models
by: Ni, Junrui, et al.
Published: (2024)
by: Ni, Junrui, et al.
Published: (2024)
Similar Items
-
Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis
by: Do, Cong-Thanh, et al.
Published: (2024) -
Speech Recognition Rescoring with Large Speech-Text Foundation Models
by: Shivakumar, Prashanth Gurunath, et al.
Published: (2024) -
Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
by: Cornell, Samuele, et al.
Published: (2024) -
Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation
by: Huang, Ruizhe, et al.
Published: (2024) -
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
by: Wang, Xi, et al.
Published: (2026)