Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Hongfu, Cui, Zhouying, Gu, Xiangming, Wang, Ye
Format:	Preprint
Published:	2026
Subjects:	Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2601.14744
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908779495292928
author	Liu, Hongfu Cui, Zhouying Gu, Xiangming Wang, Ye
author_facet	Liu, Hongfu Cui, Zhouying Gu, Xiangming Wang, Ye
contents	Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_14744
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Unlocking Large Audio-Language Models for Interactive Language Learning Liu, Hongfu Cui, Zhouying Gu, Xiangming Wang, Ye Sound Audio and Speech Processing Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.
title	Unlocking Large Audio-Language Models for Interactive Language Learning
topic	Sound Audio and Speech Processing
url	https://arxiv.org/abs/2601.14744

Similar Items