Salvato in:
Dettagli Bibliografici
Autori principali: Imperial, Joseph Marvin, Barayan, Abdullah, Stodden, Regina, Wilkens, Rodrigo, Sanchez, Ricardo Munoz, Gao, Lingyun, Torgbi, Melissa, Knight, Dawn, Forey, Gail, Jablonkai, Reka R., Kochmar, Ekaterina, Reynolds, Robert, Ribeiro, Eugénio, Saggion, Horacio, Volodina, Elena, Vajjala, Sowmya, François, Thomas, Alva-Manchego, Fernando, Madabushi, Harish Tayyar
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2506.01419
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866911156123205632
author Imperial, Joseph Marvin
Barayan, Abdullah
Stodden, Regina
Wilkens, Rodrigo
Sanchez, Ricardo Munoz
Gao, Lingyun
Torgbi, Melissa
Knight, Dawn
Forey, Gail
Jablonkai, Reka R.
Kochmar, Ekaterina
Reynolds, Robert
Ribeiro, Eugénio
Saggion, Horacio
Volodina, Elena
Vajjala, Sowmya
François, Thomas
Alva-Manchego, Fernando
Madabushi, Harish Tayyar
author_facet Imperial, Joseph Marvin
Barayan, Abdullah
Stodden, Regina
Wilkens, Rodrigo
Sanchez, Ricardo Munoz
Gao, Lingyun
Torgbi, Melissa
Knight, Dawn
Forey, Gail
Jablonkai, Reka R.
Kochmar, Ekaterina
Reynolds, Robert
Ribeiro, Eugénio
Saggion, Horacio
Volodina, Elena
Vajjala, Sowmya
François, Thomas
Alva-Manchego, Fernando
Madabushi, Harish Tayyar
contents We introduce UniversalCEFR, a large-scale multilingual and multidimensional dataset of texts annotated with CEFR (Common European Framework of Reference) levels in 13 languages. To enable open research in automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modelling across tasks and languages. To demonstrate its utility, we conduct benchmarking experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution for language proficiency research by standardising dataset formats, and promoting their accessibility to the global research community.
format Preprint
id arxiv_https___arxiv_org_abs_2506_01419
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment
Imperial, Joseph Marvin
Barayan, Abdullah
Stodden, Regina
Wilkens, Rodrigo
Sanchez, Ricardo Munoz
Gao, Lingyun
Torgbi, Melissa
Knight, Dawn
Forey, Gail
Jablonkai, Reka R.
Kochmar, Ekaterina
Reynolds, Robert
Ribeiro, Eugénio
Saggion, Horacio
Volodina, Elena
Vajjala, Sowmya
François, Thomas
Alva-Manchego, Fernando
Madabushi, Harish Tayyar
Computation and Language
We introduce UniversalCEFR, a large-scale multilingual and multidimensional dataset of texts annotated with CEFR (Common European Framework of Reference) levels in 13 languages. To enable open research in automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modelling across tasks and languages. To demonstrate its utility, we conduct benchmarking experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution for language proficiency research by standardising dataset formats, and promoting their accessibility to the global research community.
title UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment
topic Computation and Language
url https://arxiv.org/abs/2506.01419