Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Wang, Ke, He, Lei, Liu, Kun, Deng, Yan, Wei, Wenning, Zhao, Sheng
Format:	Preprint
Publié:	2025
Sujets:	Sound Computation and Language Audio and Speech Processing
Accès en ligne:	https://arxiv.org/abs/2503.11229
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866929759824379904
author	Wang, Ke He, Lei Liu, Kun Deng, Yan Wei, Wenning Zhao, Sheng
author_facet	Wang, Ke He, Lei Liu, Kun Deng, Yan Wei, Wenning Zhao, Sheng
contents	Large Multimodal Models (LMMs) have demonstrated exceptional performance across a wide range of domains. This paper explores their potential in pronunciation assessment tasks, with a particular focus on evaluating the capabilities of the Generative Pre-trained Transformer (GPT) model, specifically GPT-4o. Our study investigates its ability to process speech and audio for pronunciation assessment across multiple levels of granularity and dimensions, with an emphasis on feedback generation and scoring. For our experiments, we use the publicly available Speechocean762 dataset. The evaluation focuses on two key aspects: multi-level scoring and the practicality of the generated feedback. Scoring results are compared against the manual scores provided in the Speechocean762 dataset, while feedback quality is assessed using Large Language Models (LLMs). The findings highlight the effectiveness of integrating LMMs with traditional methods for pronunciation assessment, offering insights into the model's strengths and identifying areas for further improvement.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_11229
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment Wang, Ke He, Lei Liu, Kun Deng, Yan Wei, Wenning Zhao, Sheng Sound Computation and Language Audio and Speech Processing Large Multimodal Models (LMMs) have demonstrated exceptional performance across a wide range of domains. This paper explores their potential in pronunciation assessment tasks, with a particular focus on evaluating the capabilities of the Generative Pre-trained Transformer (GPT) model, specifically GPT-4o. Our study investigates its ability to process speech and audio for pronunciation assessment across multiple levels of granularity and dimensions, with an emphasis on feedback generation and scoring. For our experiments, we use the publicly available Speechocean762 dataset. The evaluation focuses on two key aspects: multi-level scoring and the practicality of the generated feedback. Scoring results are compared against the manual scores provided in the Speechocean762 dataset, while feedback quality is assessed using Large Language Models (LLMs). The findings highlight the effectiveness of integrating LMMs with traditional methods for pronunciation assessment, offering insights into the model's strengths and identifying areas for further improvement.
title	Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment
topic	Sound Computation and Language Audio and Speech Processing
url	https://arxiv.org/abs/2503.11229

Documents similaires