:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Hossain, Akram, Abdelfattah, Rabab, Wang, Xiaofeng, Abdelfatah, Kareem
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.05371
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
by: Soumik, Sadman Kabir
Published: (2026)

BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
by: Tong, Terry, et al.
Published: (2025)

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
by: Dev, Sunishchal, et al.
Published: (2026)

Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems
by: Li, Xiaochuan, et al.
Published: (2025)

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
by: Wang, Yidong, et al.
Published: (2025)

Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry
by: Li, Zhuochun, et al.
Published: (2026)

JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by: Thakur, Aman Singh, et al.
Published: (2024)

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation
by: Wang, Yutong, et al.
Published: (2025)

A Survey on LLM-as-a-Judge
by: Gu, Jiawei, et al.
Published: (2024)

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
by: Jiang, Hongchao, et al.
Published: (2025)

Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement
by: Han, Steve, et al.
Published: (2025)

Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)

Auto-Prompt Ensemble for LLM Judge
by: Li, Jiajie, et al.
Published: (2025)

To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay
by: Dey, Soumik, et al.
Published: (2025)

JudgeLRM: Large Reasoning Models as a Judge
by: Chen, Nuo, et al.
Published: (2025)

Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials
by: He, Peng, et al.
Published: (2026)

JudgeLM: Fine-tuned Large Language Models are Scalable Judges
by: Zhu, Lianghui, et al.
Published: (2023)

JudgeFlow: Agentic Workflow Optimization via Block Judge
by: Ma, Zihan, et al.
Published: (2026)

Multi-Agent Debate for LLM Judges with Adaptive Stability Detection
by: Hu, Tianyu, et al.
Published: (2025)

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
by: Chen, Dongping, et al.
Published: (2024)

Are We on the Right Way to Assessing LLM-as-a-Judge?
by: Feng, Yuanning, et al.
Published: (2025)

Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution
by: Tian, Zailong, et al.
Published: (2025)

Who's Your Judge? On the Detectability of LLM-Generated Judgments
by: Li, Dawei, et al.
Published: (2025)

LLM-as-a-Judge for Time Series Explanations
by: Sivalingam, Preetham, et al.
Published: (2026)

UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge
by: Zhang, Yang, et al.
Published: (2025)

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
by: Koo, Hamin, et al.
Published: (2025)

Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
by: Xu, Austin, et al.
Published: (2025)

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
by: Ye, Jiayi, et al.
Published: (2024)

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning
by: Duo, Jiangshan, et al.
Published: (2026)

PentestJudge: Judging Agent Behavior Against Operational Requirements
by: Caldwell, Shane, et al.
Published: (2025)

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
by: Saha, Swarnadeep, et al.
Published: (2025)

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
by: Yuan, Tongxin, et al.
Published: (2024)

When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs
by: Yu, Fangyi
Published: (2025)

Bi-Level Prompt Optimization for Multimodal LLM-as-a-Judge
by: Pan, Bo, et al.
Published: (2026)

Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations
by: Jain, Suryaansh, et al.
Published: (2025)

Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge
by: Ma, Chiyu, et al.
Published: (2025)

Think-J: Learning to Think for Generative LLM-as-a-Judge
by: Huang, Hui, et al.
Published: (2025)

VERT: Reliable LLM Judges for Radiology Report Evaluation
by: Bologna, Federica, et al.
Published: (2026)