Saved in:
Bibliographic Details
Main Authors: Liu, Hongfu, Cui, Zhouying, Gu, Xiangming, Wang, Ye
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.14744
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908779495292928
author Liu, Hongfu
Cui, Zhouying
Gu, Xiangming
Wang, Ye
author_facet Liu, Hongfu
Cui, Zhouying
Gu, Xiangming
Wang, Ye
contents Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.
format Preprint
id arxiv_https___arxiv_org_abs_2601_14744
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Unlocking Large Audio-Language Models for Interactive Language Learning
Liu, Hongfu
Cui, Zhouying
Gu, Xiangming
Wang, Ye
Sound
Audio and Speech Processing
Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.
title Unlocking Large Audio-Language Models for Interactive Language Learning
topic Sound
Audio and Speech Processing
url https://arxiv.org/abs/2601.14744