Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Qiu, Yiding, Azimi, Seyed Mahdi, Lensky, Artem
Format:	Preprint
Published:	2026
Subjects:	Software Engineering
Online Access:	https://arxiv.org/abs/2601.10093
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915731649593344
author	Qiu, Yiding Azimi, Seyed Mahdi Lensky, Artem
author_facet	Qiu, Yiding Azimi, Seyed Mahdi Lensky, Artem
contents	Large programming courses struggle to provide timely, detailed feedback on student code. We developed Mark My Works, a local autograding system that combines traditional unit testing with LLM-generated explanations. The system uses role-based prompts to analyze submissions, critique code quality, and generate pedagogical feedback while maintaining transparency in its reasoning process. We piloted the system in a 191-student engineering course, comparing AI-generated assessments with human grading on 79 submissions. While AI scores showed no linear correlation with human scores (r = -0.177, p = 0.124), both systems exhibited similar left-skewed distributions, suggesting they recognize comparable quality hierarchies despite different scoring philosophies. The AI system demonstrated more conservative scoring (mean: 59.95 vs 80.53 human) but generated significantly more detailed technical feedback.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_10093
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Mark My Works Autograder for Programming Courses Qiu, Yiding Azimi, Seyed Mahdi Lensky, Artem Software Engineering Large programming courses struggle to provide timely, detailed feedback on student code. We developed Mark My Works, a local autograding system that combines traditional unit testing with LLM-generated explanations. The system uses role-based prompts to analyze submissions, critique code quality, and generate pedagogical feedback while maintaining transparency in its reasoning process. We piloted the system in a 191-student engineering course, comparing AI-generated assessments with human grading on 79 submissions. While AI scores showed no linear correlation with human scores (r = -0.177, p = 0.124), both systems exhibited similar left-skewed distributions, suggesting they recognize comparable quality hierarchies despite different scoring philosophies. The AI system demonstrated more conservative scoring (mean: 59.95 vs 80.53 human) but generated significantly more detailed technical feedback.
title	Mark My Works Autograder for Programming Courses
topic	Software Engineering
url	https://arxiv.org/abs/2601.10093

Similar Items