Saved in:
| Main Authors: | Zhang, Binbin, Liang, Chengdong, Wang, Shuai, Geng, Xuelong, Guo, Zhao, Li, Haoyu, Yin, Hao, Yang, Xipeng, Zhang, Pengshen, Ma, Changwei, Xie, Lei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.19902 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch
by: Song, Xingchen, et al.
Published: (2024)
by: Song, Xingchen, et al.
Published: (2024)
dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition
by: Tian, Wenjie, et al.
Published: (2026)
by: Tian, Wenjie, et al.
Published: (2026)
OSUM-Pangu: An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs
by: Liao, Yujie, et al.
Published: (2026)
by: Liao, Yujie, et al.
Published: (2026)
Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty
by: Xue, Hongfei, et al.
Published: (2025)
by: Xue, Hongfei, et al.
Published: (2025)
WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark
by: Ma, Linhan, et al.
Published: (2024)
by: Ma, Linhan, et al.
Published: (2024)
Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM
by: Shi, Jiatong, et al.
Published: (2025)
by: Shi, Jiatong, et al.
Published: (2025)
Enhancing Intelligibility for Generative Target Speech Extraction via Joint Optimization with Target Speaker ASR
by: Ma, Hao, et al.
Published: (2025)
by: Ma, Hao, et al.
Published: (2025)
Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text
by: Xue, Hongfei, et al.
Published: (2024)
by: Xue, Hongfei, et al.
Published: (2024)
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation
by: Zhang, Dong, et al.
Published: (2024)
by: Zhang, Dong, et al.
Published: (2024)
Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning
by: Xue, Hongfei, et al.
Published: (2025)
by: Xue, Hongfei, et al.
Published: (2025)
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
by: Zhang, Xin, et al.
Published: (2023)
by: Zhang, Xin, et al.
Published: (2023)
SpeechAlign: Aligning Speech Generation to Human Preferences
by: Zhang, Dong, et al.
Published: (2024)
by: Zhang, Dong, et al.
Published: (2024)
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
by: Bai, Ye, et al.
Published: (2024)
by: Bai, Ye, et al.
Published: (2024)
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
by: Huang, Ailin, et al.
Published: (2025)
by: Huang, Ailin, et al.
Published: (2025)
Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation
by: Li, Hanzhao, et al.
Published: (2024)
by: Li, Hanzhao, et al.
Published: (2024)
Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition
by: Ma, Ziyang, et al.
Published: (2023)
by: Ma, Ziyang, et al.
Published: (2023)
Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
by: Zhao, Zhixian, et al.
Published: (2025)
by: Zhao, Zhixian, et al.
Published: (2025)
A Survey on Speech Large Language Models for Understanding
by: Peng, Jing, et al.
Published: (2024)
by: Peng, Jing, et al.
Published: (2024)
FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech
by: Ma, Linhan, et al.
Published: (2025)
by: Ma, Linhan, et al.
Published: (2025)
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit
by: Zhang, Xueyao, et al.
Published: (2023)
by: Zhang, Xueyao, et al.
Published: (2023)
EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
by: Ma, Ziyang, et al.
Published: (2024)
by: Ma, Ziyang, et al.
Published: (2024)
WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing
by: Dai, Yuhang, et al.
Published: (2025)
by: Dai, Yuhang, et al.
Published: (2025)
OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
by: Geng, Xuelong, et al.
Published: (2025)
by: Geng, Xuelong, et al.
Published: (2025)
Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition
by: Wu, Linzhi, et al.
Published: (2026)
by: Wu, Linzhi, et al.
Published: (2026)
ESPnet-SpeechLM: An Open Speech Language Model Toolkit
by: Tian, Jinchuan, et al.
Published: (2025)
by: Tian, Jinchuan, et al.
Published: (2025)
X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System
by: Liu, Zhanxun, et al.
Published: (2025)
by: Liu, Zhanxun, et al.
Published: (2025)
Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning
by: Tian, Wenjie, et al.
Published: (2026)
by: Tian, Wenjie, et al.
Published: (2026)
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
by: Cheng, Shihao, et al.
Published: (2026)
by: Cheng, Shihao, et al.
Published: (2026)
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
by: Xie, Yuan, et al.
Published: (2026)
by: Xie, Yuan, et al.
Published: (2026)
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)
by: Hao, Bowen, et al.
Published: (2025)
AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
by: Zhou, Dongliang, et al.
Published: (2025)
by: Zhou, Dongliang, et al.
Published: (2025)
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning
by: Zhang, Shaolei, et al.
Published: (2024)
by: Zhang, Shaolei, et al.
Published: (2024)
The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge
by: Guo, Yiwei, et al.
Published: (2024)
by: Guo, Yiwei, et al.
Published: (2024)
WeDefense: A Toolkit to Defend Against Fake Audio
by: Zhang, Lin, et al.
Published: (2026)
by: Zhang, Lin, et al.
Published: (2026)
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation
by: Ma, Zhengrui, et al.
Published: (2024)
by: Ma, Zhengrui, et al.
Published: (2024)
A Lightweight Fourier-based Network for Binaural Speech Enhancement with Spatial Cue Preservation
by: Lu, Xikun, et al.
Published: (2025)
by: Lu, Xikun, et al.
Published: (2025)
DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
by: Xie, Hanke, et al.
Published: (2025)
by: Xie, Hanke, et al.
Published: (2025)
FleSpeech: Flexibly Controllable Speech Generation with Various Prompts
by: Li, Hanzhao, et al.
Published: (2025)
by: Li, Hanzhao, et al.
Published: (2025)
Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays
by: Liu, Shupei, et al.
Published: (2022)
by: Liu, Shupei, et al.
Published: (2022)
CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction
by: Lu, Yudong, et al.
Published: (2025)
by: Lu, Yudong, et al.
Published: (2025)
Similar Items
-
TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch
by: Song, Xingchen, et al.
Published: (2024) -
dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition
by: Tian, Wenjie, et al.
Published: (2026) -
OSUM-Pangu: An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs
by: Liao, Yujie, et al.
Published: (2026) -
Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty
by: Xue, Hongfei, et al.
Published: (2025) -
WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark
by: Ma, Linhan, et al.
Published: (2024)