Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wilkens, Rodrigo, Cardon, Rémi, Folny, Vincent, François, Thomas
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2606.02009
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914622816124928
author	Wilkens, Rodrigo Cardon, Rémi Folny, Vincent François, Thomas
author_facet	Wilkens, Rodrigo Cardon, Rémi Folny, Vincent François, Thomas
contents	In Automated Essay Scoring (AES), benchmarking practices have fostered minimalist evaluation practices, in contrast with the broader-view recommendations of evaluation frameworks, such as the argument-based validation framework (ABV), which argued in favor of a multidimensional assessment of systems, especially in the context of high-stakes language tests. In this paper, we introduce an enhanced and more practical version of the ABV framework, incorporating fairness analysis, correlations with linguistic features, prediction error evaluation, and model agreement compared with human raters. Applying this framework to French AES, we compare 8 model architectures on a corpus of 27k exam essays (2 raters each) and a generalization corpus of 961 essays (at least nine raters each). Our analyses illustrate the benefits of applying the ABV framework to better understand the capabilities and pitfalls of AES models, while also advancing the state-of-the-art for French AES.
format	Preprint
id	arxiv_https___arxiv_org_abs_2606_02009
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Automated Essay Scoring and Language Certification: Assessing Generalizability, Agreement and Validity for French Wilkens, Rodrigo Cardon, Rémi Folny, Vincent François, Thomas Computation and Language In Automated Essay Scoring (AES), benchmarking practices have fostered minimalist evaluation practices, in contrast with the broader-view recommendations of evaluation frameworks, such as the argument-based validation framework (ABV), which argued in favor of a multidimensional assessment of systems, especially in the context of high-stakes language tests. In this paper, we introduce an enhanced and more practical version of the ABV framework, incorporating fairness analysis, correlations with linguistic features, prediction error evaluation, and model agreement compared with human raters. Applying this framework to French AES, we compare 8 model architectures on a corpus of 27k exam essays (2 raters each) and a generalization corpus of 961 essays (at least nine raters each). Our analyses illustrate the benefits of applying the ABV framework to better understand the capabilities and pitfalls of AES models, while also advancing the state-of-the-art for French AES.
title	Automated Essay Scoring and Language Certification: Assessing Generalizability, Agreement and Validity for French
topic	Computation and Language
url	https://arxiv.org/abs/2606.02009

Similar Items