Saved in:
Bibliographic Details
Main Authors: You, Runyang, Cai, Hongru, Zhang, Caiqi, Xu, Qiancheng, Liu, Meng, Yu, Tiezheng, Li, Yongqi, Li, Wenjie
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.05111
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908754048450560
author You, Runyang
Cai, Hongru
Zhang, Caiqi
Xu, Qiancheng
Liu, Meng
Yu, Tiezheng
Li, Yongqi
Li, Wenjie
author_facet You, Runyang
Cai, Hongru
Zhang, Caiqi
Xu, Qiancheng
Liu, Meng
Yu, Tiezheng
Li, Yongqi
Li, Wenjie
contents LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
format Preprint
id arxiv_https___arxiv_org_abs_2601_05111
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Agent-as-a-Judge
You, Runyang
Cai, Hongru
Zhang, Caiqi
Xu, Qiancheng
Liu, Meng
Yu, Tiezheng
Li, Yongqi
Li, Wenjie
Computation and Language
Artificial Intelligence
LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
title Agent-as-a-Judge
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2601.05111