:: Library Catalog

Buchumschlag

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shao, Mingchen, Su, Hang, Tian, Wenjie, Mu, Bingshen, Lin, Zhennan, Fan, Lichun, Luo, Zhenbo, Luan, Jian, Xie, Lei
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Audio and Speech Processing
Online-Zugang:	https://arxiv.org/abs/2604.22245
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Ähnliche Einträge

Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition
von: Zhang, Yiru, et al.
Veröffentlicht: (2025)

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages
von: Shao, Mingchen, et al.
Veröffentlicht: (2025)

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning
von: Tian, Wenjie, et al.
Veröffentlicht: (2026)

LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech
von: Mu, Bingshen, et al.
Veröffentlicht: (2026)

dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition
von: Tian, Wenjie, et al.
Veröffentlicht: (2026)

Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation
von: Wang, Zheng, et al.
Veröffentlicht: (2026)

Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR
von: Shao, Mingchen, et al.
Veröffentlicht: (2025)

HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models
von: Mu, Bingshen, et al.
Veröffentlicht: (2024)

Efficient Scaling for LLM-based ASR
von: Mu, Bingshen, et al.
Veröffentlicht: (2025)

Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model
von: Xia, Kangxiang, et al.
Veröffentlicht: (2026)

Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition
von: Mu, Bingshen, et al.
Veröffentlicht: (2025)

Lightweight speech enhancement guided target speech extraction in noisy multi-speaker scenarios
von: Huang, Ziling, et al.
Veröffentlicht: (2025)

Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods
von: Mu, Bingshen, et al.
Veröffentlicht: (2025)

MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition
von: Mu, Bingshen, et al.
Veröffentlicht: (2024)

Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework
von: Yang, Hsiang-Cheng, et al.
Veröffentlicht: (2026)

Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio question answering
von: Zhao, Jinghua, et al.
Veröffentlicht: (2025)

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models
von: Xue, Hongfei, et al.
Veröffentlicht: (2023)

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions
von: Zhu, Xinfa, et al.
Veröffentlicht: (2025)

WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem
von: Wang, Chengyou, et al.
Veröffentlicht: (2026)

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets
von: Geng, Xuelong, et al.
Veröffentlicht: (2024)

EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs
von: Tian, Wenjie, et al.
Veröffentlicht: (2026)

SALM: Spatial Audio Language Model with Structured Embeddings for Understanding and Editing
von: Hu, Jinbo, et al.
Veröffentlicht: (2025)

MSU-Bench: Towards Understanding the Conversational Multi-talker Scenarios
von: Wang, Shuai, et al.
Veröffentlicht: (2025)

MiDashengLM: Efficient Audio Understanding with General Audio Captions
von: Dinkel, Heinrich, et al.
Veröffentlicht: (2025)

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
von: Geng, Xuelong, et al.
Veröffentlicht: (2025)

Listen, Think, and Understand
von: Gong, Yuan, et al.
Veröffentlicht: (2023)

Borderless Long Speech Synthesis
von: Song, Xingchen, et al.
Veröffentlicht: (2026)

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
von: Li, Longhao, et al.
Veröffentlicht: (2026)

Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
von: Li, Guojian, et al.
Veröffentlicht: (2026)

Extending Audio Context for Long-Form Understanding in Large Audio-Language Models
von: Chaichana, Yuatyong, et al.
Veröffentlicht: (2025)

DIFFA: Large Language Diffusion Models Can Listen and Understand
von: Zhou, Jiaming, et al.
Veröffentlicht: (2025)

Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models
von: He, Haolin, et al.
Veröffentlicht: (2025)

Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling
von: Korbar, Bruno, et al.
Veröffentlicht: (2024)

Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition
von: Yang, Mu, et al.
Veröffentlicht: (2025)

Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis
von: Xu, Tianyi, et al.
Veröffentlicht: (2025)

DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
von: Tian, Wenjie, et al.
Veröffentlicht: (2025)

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT
von: Du, Zhihao, et al.
Veröffentlicht: (2023)

Text-aware and Context-aware Expressive Audiobook Speech Synthesis
von: Guo, Dake, et al.
Veröffentlicht: (2024)

Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis
von: Tian, Wenjie, et al.
Veröffentlicht: (2025)

Listenable Maps for Audio Classifiers
von: Paissan, Francesco, et al.
Veröffentlicht: (2024)