Saved in:
| Main Authors: | Dev, Sunishchal, Sloan, Andrew, Kavner, Joshua, Kong, Nicholas, Sandler, Morgan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.05399 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
by: Weng, Shihao, et al.
Published: (2026)
by: Weng, Shihao, et al.
Published: (2026)
VERT: Reliable LLM Judges for Radiology Report Evaluation
by: Bologna, Federica, et al.
Published: (2026)
by: Bologna, Federica, et al.
Published: (2026)
LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost
by: Huang, Donghao, et al.
Published: (2025)
by: Huang, Donghao, et al.
Published: (2025)
Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory
by: Choi, Junhyuk, et al.
Published: (2026)
by: Choi, Junhyuk, et al.
Published: (2026)
Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation
by: Radharapu, Bhaktipriya, et al.
Published: (2025)
by: Radharapu, Bhaktipriya, et al.
Published: (2025)
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
by: Gupta, Manan, et al.
Published: (2026)
by: Gupta, Manan, et al.
Published: (2026)
From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges
by: Hong, Yihan, et al.
Published: (2026)
by: Hong, Yihan, et al.
Published: (2026)
A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
by: Schwinn, Leo, et al.
Published: (2026)
by: Schwinn, Leo, et al.
Published: (2026)
Is Your Video Language Model a Reliable Judge?
by: Liu, Ming, et al.
Published: (2025)
by: Liu, Ming, et al.
Published: (2025)
Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
by: Zhang, Xingjian, et al.
Published: (2025)
by: Zhang, Xingjian, et al.
Published: (2025)
MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation
by: Wang, Yutong, et al.
Published: (2025)
by: Wang, Yutong, et al.
Published: (2025)
Average-Case Analysis of Iterative Voting
by: Kavner, Joshua, et al.
Published: (2024)
by: Kavner, Joshua, et al.
Published: (2024)
Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization
by: Egbuna, Nathan, et al.
Published: (2025)
by: Egbuna, Nathan, et al.
Published: (2025)
Enabling Weak LLMs to Judge Response Reliability via Meta Ranking
by: Liu, Zijun, et al.
Published: (2024)
by: Liu, Zijun, et al.
Published: (2024)
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)
by: Shi, Lin, et al.
Published: (2024)
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
by: Soumik, Sadman Kabir
Published: (2026)
by: Soumik, Sadman Kabir
Published: (2026)
LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection
by: Hossain, Akram, et al.
Published: (2026)
by: Hossain, Akram, et al.
Published: (2026)
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
by: Tong, Terry, et al.
Published: (2025)
by: Tong, Terry, et al.
Published: (2025)
ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models
by: Owiredu-Ashley, Harry
Published: (2026)
by: Owiredu-Ashley, Harry
Published: (2026)
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
by: Jin, Jiho, et al.
Published: (2026)
by: Jin, Jiho, et al.
Published: (2026)
Assessing the Reliability of Large Language Models in the Bengali Legal Context: A Comparative Evaluation Using LLM-as-Judge and Legal Experts
by: Aftahee, Sabik, et al.
Published: (2025)
by: Aftahee, Sabik, et al.
Published: (2025)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
by: Wang, Yidong, et al.
Published: (2025)
by: Wang, Yidong, et al.
Published: (2025)
Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems
by: Li, Xiaochuan, et al.
Published: (2025)
by: Li, Xiaochuan, et al.
Published: (2025)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by: Thakur, Aman Singh, et al.
Published: (2024)
by: Thakur, Aman Singh, et al.
Published: (2024)
Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering
by: Mao, Nathan, et al.
Published: (2026)
by: Mao, Nathan, et al.
Published: (2026)
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
by: Jiang, Hongchao, et al.
Published: (2025)
by: Jiang, Hongchao, et al.
Published: (2025)
Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement
by: Han, Steve, et al.
Published: (2025)
by: Han, Steve, et al.
Published: (2025)
A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test
by: Sartori, Camilo Chacón, et al.
Published: (2026)
by: Sartori, Camilo Chacón, et al.
Published: (2026)
A Survey on LLM-as-a-Judge
by: Gu, Jiawei, et al.
Published: (2024)
by: Gu, Jiawei, et al.
Published: (2024)
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)
by: Clegg, Kester, et al.
Published: (2025)
Auto-Prompt Ensemble for LLM Judge
by: Li, Jiajie, et al.
Published: (2025)
by: Li, Jiajie, et al.
Published: (2025)
To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay
by: Dey, Soumik, et al.
Published: (2025)
by: Dey, Soumik, et al.
Published: (2025)
JudgeFlow: Agentic Workflow Optimization via Block Judge
by: Ma, Zihan, et al.
Published: (2026)
by: Ma, Zihan, et al.
Published: (2026)
Theorem Prover as a Judge for Synthetic Data Generation
by: Leang, Joshua Ong Jun, et al.
Published: (2025)
by: Leang, Joshua Ong Jun, et al.
Published: (2025)
Who's Your Judge? On the Detectability of LLM-Generated Judgments
by: Li, Dawei, et al.
Published: (2025)
by: Li, Dawei, et al.
Published: (2025)
LLM-as-a-Judge for Time Series Explanations
by: Sivalingam, Preetham, et al.
Published: (2026)
by: Sivalingam, Preetham, et al.
Published: (2026)
JudgeLRM: Large Reasoning Models as a Judge
by: Chen, Nuo, et al.
Published: (2025)
by: Chen, Nuo, et al.
Published: (2025)
Limits of Emergent Reasoning of Large Language Models in Agentic Frameworks for Deterministic Games
by: Su, Chris, et al.
Published: (2025)
by: Su, Chris, et al.
Published: (2025)
Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
by: Koo, Hamin, et al.
Published: (2025)
by: Koo, Hamin, et al.
Published: (2025)
Similar Items
-
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
by: Weng, Shihao, et al.
Published: (2026) -
VERT: Reliable LLM Judges for Radiology Report Evaluation
by: Bologna, Federica, et al.
Published: (2026) -
LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost
by: Huang, Donghao, et al.
Published: (2025) -
Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory
by: Choi, Junhyuk, et al.
Published: (2026) -
Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation
by: Radharapu, Bhaktipriya, et al.
Published: (2025)