Saved in:
Bibliographic Details
Main Authors: Choi, Junhyuk, Park, Sohhyung, Cho, Chanhee, Park, Hyeonchu, Kim, Bugeun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.00521
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914615166763008
author Choi, Junhyuk
Park, Sohhyung
Cho, Chanhee
Park, Hyeonchu
Kim, Bugeun
author_facet Choi, Junhyuk
Park, Sohhyung
Cho, Chanhee
Park, Hyeonchu
Kim, Bugeun
contents While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.
format Preprint
id arxiv_https___arxiv_org_abs_2602_00521
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory
Choi, Junhyuk
Park, Sohhyung
Cho, Chanhee
Park, Hyeonchu
Kim, Bugeun
Artificial Intelligence
While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.
title Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory
topic Artificial Intelligence
url https://arxiv.org/abs/2602.00521