Saved in:
| Main Authors: | Li, Xiaochuan, Wang, Ke, Gouda, Girija, Choudhary, Shubham, Wang, Yaqun, Hu, Linwei, Vaughan, Joel, Lecue, Freddy |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.01786 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
by: Verga, Pat, et al.
Published: (2024)
by: Verga, Pat, et al.
Published: (2024)
Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations
by: Jain, Suryaansh, et al.
Published: (2025)
by: Jain, Suryaansh, et al.
Published: (2025)
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by: Thakur, Aman Singh, et al.
Published: (2024)
by: Thakur, Aman Singh, et al.
Published: (2024)
Who's Your Judge? On the Detectability of LLM-Generated Judgments
by: Li, Dawei, et al.
Published: (2025)
by: Li, Dawei, et al.
Published: (2025)
MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation
by: Wang, Yutong, et al.
Published: (2025)
by: Wang, Yutong, et al.
Published: (2025)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection
by: Hossain, Akram, et al.
Published: (2026)
by: Hossain, Akram, et al.
Published: (2026)
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
by: Soumik, Sadman Kabir
Published: (2026)
by: Soumik, Sadman Kabir
Published: (2026)
The Effect of Data Poisoning on Counterfactual Explanations
by: Artelt, André, et al.
Published: (2024)
by: Artelt, André, et al.
Published: (2024)
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
by: Tong, Terry, et al.
Published: (2025)
by: Tong, Terry, et al.
Published: (2025)
Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
by: Reese, May Lynn, et al.
Published: (2026)
by: Reese, May Lynn, et al.
Published: (2026)
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)
by: Clegg, Kester, et al.
Published: (2025)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
by: Wang, Yidong, et al.
Published: (2025)
by: Wang, Yidong, et al.
Published: (2025)
Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
by: Dev, Sunishchal, et al.
Published: (2026)
by: Dev, Sunishchal, et al.
Published: (2026)
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)
by: Shi, Lin, et al.
Published: (2024)
On Evaluating LLM Alignment by Evaluating LLMs as Judges
by: Liu, Yixin, et al.
Published: (2025)
by: Liu, Yixin, et al.
Published: (2025)
Multi-Agent Debate for LLM Judges with Adaptive Stability Detection
by: Hu, Tianyu, et al.
Published: (2025)
by: Hu, Tianyu, et al.
Published: (2025)
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
by: Saha, Swarnadeep, et al.
Published: (2025)
by: Saha, Swarnadeep, et al.
Published: (2025)
Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials
by: He, Peng, et al.
Published: (2026)
by: He, Peng, et al.
Published: (2026)
How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows
by: Han, Songhee, et al.
Published: (2026)
by: Han, Songhee, et al.
Published: (2026)
A Survey on LLM-as-a-Judge
by: Gu, Jiawei, et al.
Published: (2024)
by: Gu, Jiawei, et al.
Published: (2024)
VERT: Reliable LLM Judges for Radiology Report Evaluation
by: Bologna, Federica, et al.
Published: (2026)
by: Bologna, Federica, et al.
Published: (2026)
Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
by: Fabbri, Francesco, et al.
Published: (2025)
by: Fabbri, Francesco, et al.
Published: (2025)
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
by: Xu, Austin, et al.
Published: (2025)
by: Xu, Austin, et al.
Published: (2025)
Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems
by: Chhabra, Mukul, et al.
Published: (2026)
by: Chhabra, Mukul, et al.
Published: (2026)
Auto-Prompt Ensemble for LLM Judge
by: Li, Jiajie, et al.
Published: (2025)
by: Li, Jiajie, et al.
Published: (2025)
Are We on the Right Way to Assessing LLM-as-a-Judge?
by: Feng, Yuanning, et al.
Published: (2025)
by: Feng, Yuanning, et al.
Published: (2025)
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
by: Jiang, Hongchao, et al.
Published: (2025)
by: Jiang, Hongchao, et al.
Published: (2025)
Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation
by: Huang, Tzu-Heng, et al.
Published: (2025)
by: Huang, Tzu-Heng, et al.
Published: (2025)
Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation
by: Feuer, Benjamin, et al.
Published: (2026)
by: Feuer, Benjamin, et al.
Published: (2026)
BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
by: Lai, Peng, et al.
Published: (2026)
by: Lai, Peng, et al.
Published: (2026)
UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge
by: Zhang, Yang, et al.
Published: (2025)
by: Zhang, Yang, et al.
Published: (2025)
LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation
by: Enguehard, Joseph, et al.
Published: (2025)
by: Enguehard, Joseph, et al.
Published: (2025)
JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation
by: Shi, Zhichao, et al.
Published: (2025)
by: Shi, Zhichao, et al.
Published: (2025)
Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution
by: Tian, Zailong, et al.
Published: (2025)
by: Tian, Zailong, et al.
Published: (2025)
LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost
by: Huang, Donghao, et al.
Published: (2025)
by: Huang, Donghao, et al.
Published: (2025)
Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
by: Koo, Hamin, et al.
Published: (2025)
by: Koo, Hamin, et al.
Published: (2025)
Routing to the Right Expertise: A Trustworthy Judge for Instruction-based Image Editing
by: Sun, Chenxi, et al.
Published: (2025)
by: Sun, Chenxi, et al.
Published: (2025)
Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement
by: Han, Steve, et al.
Published: (2025)
by: Han, Steve, et al.
Published: (2025)
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
by: Sun, Bian, et al.
Published: (2026)
by: Sun, Bian, et al.
Published: (2026)
Similar Items
-
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
by: Verga, Pat, et al.
Published: (2024) -
Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations
by: Jain, Suryaansh, et al.
Published: (2025) -
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by: Thakur, Aman Singh, et al.
Published: (2024) -
Who's Your Judge? On the Detectability of LLM-Generated Judgments
by: Li, Dawei, et al.
Published: (2025) -
MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation
by: Wang, Yutong, et al.
Published: (2025)