Saved in:
Bibliographic Details
Main Authors: He, Peng, Li, Zhaohui, Wang, Zeyuan, Xiong, Jinjun, Li, Tingting
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.13243
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911447521427456
author He, Peng
Li, Zhaohui
Wang, Zeyuan
Xiong, Jinjun
Li, Tingting
author_facet He, Peng
Li, Zhaohui
Wang, Zeyuan
Xiong, Jinjun
Li, Tingting
contents Designing high-quality, standards-aligned instructional materials for K--12 science is time-consuming and expertise-intensive. This study examines what human experts notice when reviewing AI-generated evaluations of such materials, aiming to translate their insights into design principles for a future GenAI-based instructional material design agent. We intentionally selected 12 high-quality curriculum units across life, physical, and earth sciences from validated programs such as OpenSciEd and Multiple Literacies in Project-based Learning. Using the EQuIP rubric with 9 evaluation items, we prompted GPT-4o, Claude, and Gemini to produce numerical ratings and written rationales for each unit, generating 648 evaluation outputs. Two science education experts independently reviewed all outputs, marking agreement (1) or disagreement (0) for both scores and rationales, and offering qualitative reflections on AI reasoning. This process surfaces patterns in where LLM judgments align with or diverge from expert perspectives, revealing reasoning strengths, gaps, and contextual nuances. These insights will directly inform the development of a domain-specific GenAI agent to support the design of high-quality instructional materials in K--12 science education.
format Preprint
id arxiv_https___arxiv_org_abs_2602_13243
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials
He, Peng
Li, Zhaohui
Wang, Zeyuan
Xiong, Jinjun
Li, Tingting
Computers and Society
Artificial Intelligence
Designing high-quality, standards-aligned instructional materials for K--12 science is time-consuming and expertise-intensive. This study examines what human experts notice when reviewing AI-generated evaluations of such materials, aiming to translate their insights into design principles for a future GenAI-based instructional material design agent. We intentionally selected 12 high-quality curriculum units across life, physical, and earth sciences from validated programs such as OpenSciEd and Multiple Literacies in Project-based Learning. Using the EQuIP rubric with 9 evaluation items, we prompted GPT-4o, Claude, and Gemini to produce numerical ratings and written rationales for each unit, generating 648 evaluation outputs. Two science education experts independently reviewed all outputs, marking agreement (1) or disagreement (0) for both scores and rationales, and offering qualitative reflections on AI reasoning. This process surfaces patterns in where LLM judgments align with or diverge from expert perspectives, revealing reasoning strengths, gaps, and contextual nuances. These insights will directly inform the development of a domain-specific GenAI agent to support the design of high-quality instructional materials in K--12 science education.
title Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials
topic Computers and Society
Artificial Intelligence
url https://arxiv.org/abs/2602.13243