:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Mu, Bingshen, Shi, Xian, Wang, Xiong, Liu, Hexin, Xu, Jin, Xie, Lei
Format:	Preprint
Published:	2026
Subjects:	Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2601.18220
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition
by: Mu, Bingshen, et al.
Published: (2025)

dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition
by: Tian, Wenjie, et al.
Published: (2026)

Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model
by: Xia, Kangxiang, et al.
Published: (2026)

Efficient Scaling for LLM-based ASR
by: Mu, Bingshen, et al.
Published: (2025)

Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods
by: Mu, Bingshen, et al.
Published: (2025)

HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models
by: Mu, Bingshen, et al.
Published: (2024)

MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition
by: Mu, Bingshen, et al.
Published: (2024)

FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data
by: Liu, Dancheng, et al.
Published: (2024)

Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration
by: Lou, Haowei, et al.
Published: (2024)

Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR
by: Shao, Mingchen, et al.
Published: (2025)

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages
by: Shao, Mingchen, et al.
Published: (2025)

Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text
by: Xue, Hongfei, et al.
Published: (2024)

PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification
by: Seth, Ashish, et al.
Published: (2024)

Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning
by: Xue, Hongfei, et al.
Published: (2025)

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets
by: Geng, Xuelong, et al.
Published: (2024)

S2ST-Omni: Hierarchical Language-Aware SpeechLLM Adaptation for Multilingual Speech-to-Speech Translation
by: Pan, Yu, et al.
Published: (2025)

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models
by: Xue, Hongfei, et al.
Published: (2023)

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning
by: Tian, Wenjie, et al.
Published: (2026)

WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem
by: Wang, Chengyou, et al.
Published: (2026)

BFA: Real-time Multilingual Text-to-speech Forced Alignment
by: Rehman, Abdul, et al.
Published: (2025)

GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling
by: Yao, Jixun, et al.
Published: (2025)

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers
by: Stooke, Adam, et al.
Published: (2025)

EASY: Emotion-aware Speaker Anonymization via Factorized Distillation
by: Yao, Jixun, et al.
Published: (2025)

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
by: Xie, Hanke, et al.
Published: (2025)

Efficient Long-Form Speech Recognition for General Speech In-Context Learning
by: Yen, Hao, et al.
Published: (2024)

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
by: Chen, Huakang, et al.
Published: (2026)

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty
by: Xue, Hongfei, et al.
Published: (2025)

Chunkwise Aligners for Streaming Speech Recognition
by: Teo, Wen Shen, et al.
Published: (2026)

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
by: Xing, Yazhou, et al.
Published: (2024)

LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
by: Yu, Fan, et al.
Published: (2024)

Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding
by: Ahn, Hoseong, et al.
Published: (2026)

FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration
by: Xu, Kai-Tuo, et al.
Published: (2025)

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
by: Bai, Ye, et al.
Published: (2024)

ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription
by: Le, Khanh, et al.
Published: (2025)

Parallel Synthesis for Autoregressive Speech Generation
by: Hsu, Po-chun, et al.
Published: (2022)

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining
by: Zhou, Rui, et al.
Published: (2024)

Long-Form Speech Generation with Spoken Language Models
by: Park, Se Jin, et al.
Published: (2024)

VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling
by: Zhou, Yixuan, et al.
Published: (2024)

KALL-E:Autoregressive Speech Synthesis with Next-Distribution Prediction
by: Xia, Kangxiang, et al.
Published: (2024)

Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer
by: Honda, Tomoki, et al.
Published: (2024)