Saved in:
| Main Authors: | Zhu, Ziyi, Tieleman, Olivier, Bukhtiyarov, Alexey, Chen, Jinghong |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.01865 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)
by: Li, Qingquan, et al.
Published: (2025)
Quantifying and Mitigating Self-Preference Bias of LLM Judges
by: Yang, Jinming, et al.
Published: (2026)
by: Yang, Jinming, et al.
Published: (2026)
The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
by: Marioriyad, Arash, et al.
Published: (2025)
by: Marioriyad, Arash, et al.
Published: (2025)
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)
by: Shi, Lin, et al.
Published: (2024)
Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge
by: Liu, Zhuo, et al.
Published: (2025)
by: Liu, Zhuo, et al.
Published: (2025)
Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization
by: Zhou, Hongli, et al.
Published: (2026)
by: Zhou, Hongli, et al.
Published: (2026)
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
by: Li, Haitao, et al.
Published: (2024)
by: Li, Haitao, et al.
Published: (2024)
Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
by: Fujinuma, Yoshinari
Published: (2025)
by: Fujinuma, Yoshinari
Published: (2025)
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
by: Zhu, Ziyi, et al.
Published: (2025)
by: Zhu, Ziyi, et al.
Published: (2025)
BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
by: Lai, Peng, et al.
Published: (2026)
by: Lai, Peng, et al.
Published: (2026)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
by: Zhang, Hongbin, et al.
Published: (2026)
by: Zhang, Hongbin, et al.
Published: (2026)
Self-Preference Bias in LLM-as-a-Judge
by: Wataoka, Koki, et al.
Published: (2024)
by: Wataoka, Koki, et al.
Published: (2024)
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026)
by: Belmadani, Ikram, et al.
Published: (2026)
CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges
by: Li, Haitao, et al.
Published: (2024)
by: Li, Haitao, et al.
Published: (2024)
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
by: Zhou, Yilun, et al.
Published: (2025)
by: Zhou, Yilun, et al.
Published: (2025)
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by: Thakur, Aman Singh, et al.
Published: (2024)
by: Thakur, Aman Singh, et al.
Published: (2024)
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
by: Lee, Sua, et al.
Published: (2026)
by: Lee, Sua, et al.
Published: (2026)
FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge
by: Yang, Bo, et al.
Published: (2026)
by: Yang, Bo, et al.
Published: (2026)
Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple LLM Judges
by: Tang, Yuqi, et al.
Published: (2025)
by: Tang, Yuqi, et al.
Published: (2025)
Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation
by: Lee, Dongryeol, et al.
Published: (2026)
by: Lee, Dongryeol, et al.
Published: (2026)
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
by: Xu, Austin, et al.
Published: (2025)
by: Xu, Austin, et al.
Published: (2025)
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)
by: Clegg, Kester, et al.
Published: (2025)
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
by: Tong, Terry, et al.
Published: (2025)
by: Tong, Terry, et al.
Published: (2025)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
by: Wang, Yidong, et al.
Published: (2025)
by: Wang, Yidong, et al.
Published: (2025)
Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge
by: Zhou, Xiaolin, et al.
Published: (2026)
by: Zhou, Xiaolin, et al.
Published: (2026)
Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges
by: Tapwal, Riya, et al.
Published: (2026)
by: Tapwal, Riya, et al.
Published: (2026)
Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation
by: Moon, Jiwon, et al.
Published: (2025)
by: Moon, Jiwon, et al.
Published: (2025)
The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation
by: Marioriyad, Arash, et al.
Published: (2026)
by: Marioriyad, Arash, et al.
Published: (2026)
On Evaluating LLM Alignment by Evaluating LLMs as Judges
by: Liu, Yixin, et al.
Published: (2025)
by: Liu, Yixin, et al.
Published: (2025)
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
by: Chen, Junjie, et al.
Published: (2026)
by: Chen, Junjie, et al.
Published: (2026)
Quantitative LLM Judges
by: Sahoo, Aishwarya, et al.
Published: (2025)
by: Sahoo, Aishwarya, et al.
Published: (2025)
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
by: Huang, Hui, et al.
Published: (2024)
by: Huang, Hui, et al.
Published: (2024)
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
by: Jiang, Hongchao, et al.
Published: (2025)
by: Jiang, Hongchao, et al.
Published: (2025)
JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)
Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
by: Liu, Shuliang, et al.
Published: (2025)
by: Liu, Shuliang, et al.
Published: (2025)
MR. Judge: Multimodal Reasoner as a Judge
by: Pi, Renjie, et al.
Published: (2025)
by: Pi, Renjie, et al.
Published: (2025)
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
by: Lee, Dongryeol, et al.
Published: (2024)
by: Lee, Dongryeol, et al.
Published: (2024)
Assessing Judging Bias in Large Reasoning Models: An Empirical Study
by: Wang, Qian, et al.
Published: (2025)
by: Wang, Qian, et al.
Published: (2025)
Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition
by: Sheth, Ivaxi, et al.
Published: (2026)
by: Sheth, Ivaxi, et al.
Published: (2026)
Similar Items
-
Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025) -
Quantifying and Mitigating Self-Preference Bias of LLM Judges
by: Yang, Jinming, et al.
Published: (2026) -
The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
by: Marioriyad, Arash, et al.
Published: (2025) -
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024) -
Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge
by: Liu, Zhuo, et al.
Published: (2025)