Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ferrer, Robinson, Turgut, Damla, Chen, Zhongzhou, Sonkar, Shashank
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Computers and Society
Online Access:	https://arxiv.org/abs/2603.29559
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915901552459776
author	Ferrer, Robinson Turgut, Damla Chen, Zhongzhou Sonkar, Shashank
author_facet	Ferrer, Robinson Turgut, Damla Chen, Zhongzhou Sonkar, Shashank
contents	Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38\% worse despite requiring 5$\times$ the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28\% ECE reduction for self-reported), with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top-skewed across methods, creating a ``confidence floor'' that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \href{https://github.com/sonkar-lab/llm_grading_calibration}{here}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_29559
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment Ferrer, Robinson Turgut, Damla Chen, Zhongzhou Sonkar, Shashank Computation and Language Computers and Society Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38\% worse despite requiring 5$\times$ the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28\% ECE reduction for self-reported), with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top-skewed across methods, creating a ``confidence floor'' that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \href{https://github.com/sonkar-lab/llm_grading_calibration}{here}.
title	When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment
topic	Computation and Language Computers and Society
url	https://arxiv.org/abs/2603.29559

Similar Items