Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lee, Grandee, Wang, Yue, Lye, Che Yee, Peh, Luke
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.19529
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913146110738432
author	Lee, Grandee Wang, Yue Lye, Che Yee Peh, Luke
author_facet	Lee, Grandee Wang, Yue Lye, Che Yee Peh, Luke
contents	When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_19529
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment Lee, Grandee Wang, Yue Lye, Che Yee Peh, Luke Artificial Intelligence When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.
title	Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.19529

Similar Items