Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mahdavi, Hamed, Mahdavinia, Pouria, Malek, Samira, Mohammadipour, Pegah, Hashemi, Alireza, Daliri, Majid, Farhadi, Alireza, Khasahmadi, Amir, Mireshghallah, Niloofar, Honavar, Vasant
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2510.09021
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912640461176832
author	Mahdavi, Hamed Mahdavinia, Pouria Malek, Samira Mohammadipour, Pegah Hashemi, Alireza Daliri, Majid Farhadi, Alireza Khasahmadi, Amir Mireshghallah, Niloofar Honavar, Vasant
author_facet	Mahdavi, Hamed Mahdavinia, Pouria Malek, Samira Mohammadipour, Pegah Hashemi, Alireza Daliri, Majid Farhadi, Alireza Khasahmadi, Amir Mireshghallah, Niloofar Honavar, Vasant
contents	State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_09021
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows Mahdavi, Hamed Mahdavinia, Pouria Malek, Samira Mohammadipour, Pegah Hashemi, Alireza Daliri, Majid Farhadi, Alireza Khasahmadi, Amir Mireshghallah, Niloofar Honavar, Vasant Artificial Intelligence Machine Learning State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.
title	RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows
topic	Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2510.09021

Similar Items