Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Gaggioli, Andrea, Casaburi, Giuseppe, Ercolani, Leonardo, Collova', Francesco, Torre, Pietro, Davide, Fabrizio
Formato:	Preprint
Publicado:	2025
Materias:	Computers and Society Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2508.02442
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866913974681862144
author	Gaggioli, Andrea Casaburi, Giuseppe Ercolani, Leonardo Collova', Francesco Torre, Pietro Davide, Fabrizio
author_facet	Gaggioli, Andrea Casaburi, Giuseppe Ercolani, Leonardo Collova', Francesco Torre, Pietro Davide, Fabrizio
contents	This study investigates the reliability and validity of five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, for automated essay scoring in a real world higher education context. A total of 67 Italian-language student essays, written as part of a university psychology course, were evaluated using a four-criterion rubric (Pertinence, Coherence, Originality, Feasibility). Each model scored all essays across three prompt replications to assess intra-model stability. Human-LLM agreement was consistently low and non-significant (Quadratic Weighted Kappa), and within-model reliability across replications was similarly weak (median Kendall's W < 0.30). Systematic scoring divergences emerged, including a tendency to inflate Coherence and inconsistent handling of context-dependent dimensions. Inter-model agreement analysis revealed moderate convergence for Coherence and Originality, but negligible concordance for Pertinence and Feasibility. Although limited in scope, these findings suggest that current LLMs may struggle to replicate human judgment in tasks requiring disciplinary insight and contextual sensitivity. Human oversight remains critical when evaluating open-ended academic work, particularly in interpretive domains.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_02442
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education Gaggioli, Andrea Casaburi, Giuseppe Ercolani, Leonardo Collova', Francesco Torre, Pietro Davide, Fabrizio Computers and Society Artificial Intelligence This study investigates the reliability and validity of five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, for automated essay scoring in a real world higher education context. A total of 67 Italian-language student essays, written as part of a university psychology course, were evaluated using a four-criterion rubric (Pertinence, Coherence, Originality, Feasibility). Each model scored all essays across three prompt replications to assess intra-model stability. Human-LLM agreement was consistently low and non-significant (Quadratic Weighted Kappa), and within-model reliability across replications was similarly weak (median Kendall's W < 0.30). Systematic scoring divergences emerged, including a tendency to inflate Coherence and inconsistent handling of context-dependent dimensions. Inter-model agreement analysis revealed moderate convergence for Coherence and Originality, but negligible concordance for Pertinence and Feasibility. Although limited in scope, these findings suggest that current LLMs may struggle to replicate human judgment in tasks requiring disciplinary insight and contextual sensitivity. Human oversight remains critical when evaluating open-ended academic work, particularly in interpretive domains.
title	Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education
topic	Computers and Society Artificial Intelligence
url	https://arxiv.org/abs/2508.02442

Ejemplares similares