Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	You, Runyang, Cai, Hongru, Zhang, Caiqi, Xu, Qiancheng, Liu, Meng, Yu, Tiezheng, Li, Yongqi, Li, Wenjie
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.05111
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908754048450560
author	You, Runyang Cai, Hongru Zhang, Caiqi Xu, Qiancheng Liu, Meng Yu, Tiezheng Li, Yongqi Li, Wenjie
author_facet	You, Runyang Cai, Hongru Zhang, Caiqi Xu, Qiancheng Liu, Meng Yu, Tiezheng Li, Yongqi Li, Wenjie
contents	LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_05111
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Agent-as-a-Judge You, Runyang Cai, Hongru Zhang, Caiqi Xu, Qiancheng Liu, Meng Yu, Tiezheng Li, Yongqi Li, Wenjie Computation and Language Artificial Intelligence LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
title	Agent-as-a-Judge
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2601.05111

Similar Items