Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Choi, Junhyuk, Park, Sohhyung, Cho, Chanhee, Park, Hyeonchu, Kim, Bugeun
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.00521
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914615166763008
author	Choi, Junhyuk Park, Sohhyung Cho, Chanhee Park, Hyeonchu Kim, Bugeun
author_facet	Choi, Junhyuk Park, Sohhyung Cho, Chanhee Park, Hyeonchu Kim, Bugeun
contents	While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_00521
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory Choi, Junhyuk Park, Sohhyung Cho, Chanhee Park, Hyeonchu Kim, Bugeun Artificial Intelligence While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.
title	Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory
topic	Artificial Intelligence
url	https://arxiv.org/abs/2602.00521

Similar Items