Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	He, Peng, Li, Zhaohui, Wang, Zeyuan, Xiong, Jinjun, Li, Tingting
Format:	Preprint
Published:	2026
Subjects:	Computers and Society Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.13243
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911447521427456
author	He, Peng Li, Zhaohui Wang, Zeyuan Xiong, Jinjun Li, Tingting
author_facet	He, Peng Li, Zhaohui Wang, Zeyuan Xiong, Jinjun Li, Tingting
contents	Designing high-quality, standards-aligned instructional materials for K--12 science is time-consuming and expertise-intensive. This study examines what human experts notice when reviewing AI-generated evaluations of such materials, aiming to translate their insights into design principles for a future GenAI-based instructional material design agent. We intentionally selected 12 high-quality curriculum units across life, physical, and earth sciences from validated programs such as OpenSciEd and Multiple Literacies in Project-based Learning. Using the EQuIP rubric with 9 evaluation items, we prompted GPT-4o, Claude, and Gemini to produce numerical ratings and written rationales for each unit, generating 648 evaluation outputs. Two science education experts independently reviewed all outputs, marking agreement (1) or disagreement (0) for both scores and rationales, and offering qualitative reflections on AI reasoning. This process surfaces patterns in where LLM judgments align with or diverge from expert perspectives, revealing reasoning strengths, gaps, and contextual nuances. These insights will directly inform the development of a domain-specific GenAI agent to support the design of high-quality instructional materials in K--12 science education.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_13243
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials He, Peng Li, Zhaohui Wang, Zeyuan Xiong, Jinjun Li, Tingting Computers and Society Artificial Intelligence Designing high-quality, standards-aligned instructional materials for K--12 science is time-consuming and expertise-intensive. This study examines what human experts notice when reviewing AI-generated evaluations of such materials, aiming to translate their insights into design principles for a future GenAI-based instructional material design agent. We intentionally selected 12 high-quality curriculum units across life, physical, and earth sciences from validated programs such as OpenSciEd and Multiple Literacies in Project-based Learning. Using the EQuIP rubric with 9 evaluation items, we prompted GPT-4o, Claude, and Gemini to produce numerical ratings and written rationales for each unit, generating 648 evaluation outputs. Two science education experts independently reviewed all outputs, marking agreement (1) or disagreement (0) for both scores and rationales, and offering qualitative reflections on AI reasoning. This process surfaces patterns in where LLM judgments align with or diverge from expert perspectives, revealing reasoning strengths, gaps, and contextual nuances. These insights will directly inform the development of a domain-specific GenAI agent to support the design of high-quality instructional materials in K--12 science education.
title	Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials
topic	Computers and Society Artificial Intelligence
url	https://arxiv.org/abs/2602.13243

Similar Items