Saved in:
Bibliographic Details
Main Authors: Chen, Zhongzhou, Wan, Tong
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.06910
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • This study examines the feasibility and potential advantages of using large language models, in particular GPT-4o, to perform partial credit grading of large numbers of student written responses to introductory level physics problems. Students were instructed to write down verbal explanations of their reasoning process when solving one conceptual and two numerical calculation problems on in class exams. The explanations were then graded according to a 3-item rubric with each item grades as binary (1 or 0). We first demonstrate that machine grading using GPT-4o with no examples nor reference answer can reliably agree with human graders on 70%-80% of all cases, which is equal to or higher than the level at which two human graders agree with each other. Two methods are essential for achieving this level of accuracy: 1. Adding explanation language to each rubric item that targets the errors of initial machine grading. 2. Running the grading process 5 times and taking the most frequent outcome. Next, we show that the variation in outcomes across 5 machine grading attempts as measured by the Shannon Entropy can serve as a grading confidence index, allowing a human instructor to identify ~40% of all potentially incorrect gradings by reviewing just 10 - 15% of all responses. Finally, we show that it is straightforward to use GPT-4o to write clear explanations of the partial credit grading outcomes. Those explanations can be used as feedback for students, which will allow students to understand their grades and raise different opinions when necessary. Almost all feedback messages generated were rated 3 or above on a 5-point scale by two experienced instructors. The entire grading and feedback generating process cost roughly $5 per 100 student answers, which shows immense promise for automating labor-intensive grading process by a combination of machine grading with human input and supervision.