Saved in:
Bibliographic Details
Main Authors: Qiu, Yiding, Azimi, Seyed Mahdi, Lensky, Artem
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.10093
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915731649593344
author Qiu, Yiding
Azimi, Seyed Mahdi
Lensky, Artem
author_facet Qiu, Yiding
Azimi, Seyed Mahdi
Lensky, Artem
contents Large programming courses struggle to provide timely, detailed feedback on student code. We developed Mark My Works, a local autograding system that combines traditional unit testing with LLM-generated explanations. The system uses role-based prompts to analyze submissions, critique code quality, and generate pedagogical feedback while maintaining transparency in its reasoning process. We piloted the system in a 191-student engineering course, comparing AI-generated assessments with human grading on 79 submissions. While AI scores showed no linear correlation with human scores (r = -0.177, p = 0.124), both systems exhibited similar left-skewed distributions, suggesting they recognize comparable quality hierarchies despite different scoring philosophies. The AI system demonstrated more conservative scoring (mean: 59.95 vs 80.53 human) but generated significantly more detailed technical feedback.
format Preprint
id arxiv_https___arxiv_org_abs_2601_10093
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Mark My Works Autograder for Programming Courses
Qiu, Yiding
Azimi, Seyed Mahdi
Lensky, Artem
Software Engineering
Large programming courses struggle to provide timely, detailed feedback on student code. We developed Mark My Works, a local autograding system that combines traditional unit testing with LLM-generated explanations. The system uses role-based prompts to analyze submissions, critique code quality, and generate pedagogical feedback while maintaining transparency in its reasoning process. We piloted the system in a 191-student engineering course, comparing AI-generated assessments with human grading on 79 submissions. While AI scores showed no linear correlation with human scores (r = -0.177, p = 0.124), both systems exhibited similar left-skewed distributions, suggesting they recognize comparable quality hierarchies despite different scoring philosophies. The AI system demonstrated more conservative scoring (mean: 59.95 vs 80.53 human) but generated significantly more detailed technical feedback.
title Mark My Works Autograder for Programming Courses
topic Software Engineering
url https://arxiv.org/abs/2601.10093