Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Luhtaru, Agnes, Vainikko, Martin, Liin, Krista, Allkivi-Metsoja, Kais, Kippar, Jaagup, Eslon, Pille, Fishel, Mark
Formato:	Preprint
Publicado:	2024
Materias:	Computation and Language Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2402.11671
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866916130774319104
author	Luhtaru, Agnes Vainikko, Martin Liin, Krista Allkivi-Metsoja, Kais Kippar, Jaagup Eslon, Pille Fishel, Mark
author_facet	Luhtaru, Agnes Vainikko, Martin Liin, Krista Allkivi-Metsoja, Kais Kippar, Jaagup Eslon, Pille Fishel, Mark
contents	The project was funded in 2021-2023 by the National Programme of Estonian Language Technology. Its main aim was to develop spelling and grammar correction tools for the Estonian language. The main challenge was the very small amount of available error correction data needed for such development. To mitigate this, (1) we annotated more correction data for model training and testing, (2) we tested transfer-learning, i.e. retraining machine learning models created for other tasks, so as not to depend solely on correction data, (3) we compared the developed method and model with alternatives, including large language models. We also developed automatic evaluation, which can calculate the accuracy and yield of corrections by error category, so that the effectiveness of different methods can be compared in detail. There has been a breakthrough in large language models during the project: GPT4, a commercial language model with Estonian-language support, has been created. We took into account the existence of the model when adjusting plans and in the report we present a comparison with the ability of GPT4 to improve the Estonian language text. The final results show that the approach we have developed provides better scores than GPT4 and the result is usable but not entirely reliable yet. The report also contains ideas on how GPT4 and other major language models can be implemented in the future, focusing on open-source solutions. All results of this project are open-data/open-source, with licenses that allow them to be used for purposes including commercial ones.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_11671
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Autocorrect for Estonian texts: final report from project EKTB25 Luhtaru, Agnes Vainikko, Martin Liin, Krista Allkivi-Metsoja, Kais Kippar, Jaagup Eslon, Pille Fishel, Mark Computation and Language Artificial Intelligence The project was funded in 2021-2023 by the National Programme of Estonian Language Technology. Its main aim was to develop spelling and grammar correction tools for the Estonian language. The main challenge was the very small amount of available error correction data needed for such development. To mitigate this, (1) we annotated more correction data for model training and testing, (2) we tested transfer-learning, i.e. retraining machine learning models created for other tasks, so as not to depend solely on correction data, (3) we compared the developed method and model with alternatives, including large language models. We also developed automatic evaluation, which can calculate the accuracy and yield of corrections by error category, so that the effectiveness of different methods can be compared in detail. There has been a breakthrough in large language models during the project: GPT4, a commercial language model with Estonian-language support, has been created. We took into account the existence of the model when adjusting plans and in the report we present a comparison with the ability of GPT4 to improve the Estonian language text. The final results show that the approach we have developed provides better scores than GPT4 and the result is usable but not entirely reliable yet. The report also contains ideas on how GPT4 and other major language models can be implemented in the future, focusing on open-source solutions. All results of this project are open-data/open-source, with licenses that allow them to be used for purposes including commercial ones.
title	Autocorrect for Estonian texts: final report from project EKTB25
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2402.11671

Ejemplares similares