:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Binbin, Liang, Chengdong, Wang, Shuai, Geng, Xuelong, Guo, Zhao, Li, Haoyu, Yin, Hao, Yang, Xipeng, Zhang, Pengshen, Ma, Changwei, Xie, Lei
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2509.19902
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch
by: Song, Xingchen, et al.
Published: (2024)

dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition
by: Tian, Wenjie, et al.
Published: (2026)

OSUM-Pangu: An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs
by: Liao, Yujie, et al.
Published: (2026)

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty
by: Xue, Hongfei, et al.
Published: (2025)

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark
by: Ma, Linhan, et al.
Published: (2024)

Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM
by: Shi, Jiatong, et al.
Published: (2025)

Enhancing Intelligibility for Generative Target Speech Extraction via Joint Optimization with Target Speaker ASR
by: Ma, Hao, et al.
Published: (2025)

Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text
by: Xue, Hongfei, et al.
Published: (2024)

SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation
by: Zhang, Dong, et al.
Published: (2024)

Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning
by: Xue, Hongfei, et al.
Published: (2025)

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
by: Zhang, Xin, et al.
Published: (2023)

SpeechAlign: Aligning Speech Generation to Human Preferences
by: Zhang, Dong, et al.
Published: (2024)

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
by: Bai, Ye, et al.
Published: (2024)

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
by: Huang, Ailin, et al.
Published: (2025)

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation
by: Li, Hanzhao, et al.
Published: (2024)

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition
by: Ma, Ziyang, et al.
Published: (2023)

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
by: Zhao, Zhixian, et al.
Published: (2025)

A Survey on Speech Large Language Models for Understanding
by: Peng, Jing, et al.
Published: (2024)

FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech
by: Ma, Linhan, et al.
Published: (2025)

Amphion: An Open-Source Audio, Music and Speech Generation Toolkit
by: Zhang, Xueyao, et al.
Published: (2023)

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
by: Ma, Ziyang, et al.
Published: (2024)

WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing
by: Dai, Yuhang, et al.
Published: (2025)

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
by: Geng, Xuelong, et al.
Published: (2025)

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition
by: Wu, Linzhi, et al.
Published: (2026)

ESPnet-SpeechLM: An Open Speech Language Model Toolkit
by: Tian, Jinchuan, et al.
Published: (2025)

X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System
by: Liu, Zhanxun, et al.
Published: (2025)

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning
by: Tian, Wenjie, et al.
Published: (2026)

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
by: Cheng, Shihao, et al.
Published: (2026)

Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
by: Xie, Yuan, et al.
Published: (2026)

LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)

AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
by: Zhou, Dongliang, et al.
Published: (2025)

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning
by: Zhang, Shaolei, et al.
Published: (2024)

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge
by: Guo, Yiwei, et al.
Published: (2024)

WeDefense: A Toolkit to Defend Against Fake Audio
by: Zhang, Lin, et al.
Published: (2026)

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation
by: Ma, Zhengrui, et al.
Published: (2024)

A Lightweight Fourier-based Network for Binaural Speech Enhancement with Spatial Cue Preservation
by: Lu, Xikun, et al.
Published: (2025)

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
by: Xie, Hanke, et al.
Published: (2025)

FleSpeech: Flexibly Controllable Speech Generation with Various Prompts
by: Li, Hanzhao, et al.
Published: (2025)

Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays
by: Liu, Shupei, et al.
Published: (2022)

CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction
by: Lu, Yudong, et al.
Published: (2025)