:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Qingquan, Dou, Shaoyu, Shao, Kailai, Chen, Chao, Hu, Haixiang
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2506.22316
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning
by: Dou, Shaoyu, et al.
Published: (2025)

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
by: Zhu, Ziyi, et al.
Published: (2026)

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
by: Fujinuma, Yoshinari
Published: (2025)

Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge
by: Cai, Yunna, et al.
Published: (2025)

BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
by: Lai, Peng, et al.
Published: (2026)

The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
by: Marioriyad, Arash, et al.
Published: (2025)

Self-Preference Bias in LLM-as-a-Judge
by: Wataoka, Koki, et al.
Published: (2024)

Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization
by: Zhou, Hongli, et al.
Published: (2026)

Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge
by: Zhou, Xiaolin, et al.
Published: (2026)

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)

Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport
by: Morandi, Andrea
Published: (2026)

Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge
by: Liu, Zhuo, et al.
Published: (2025)

Quantifying and Mitigating Self-Preference Bias of LLM Judges
by: Yang, Jinming, et al.
Published: (2026)

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
by: Chen, Junjie, et al.
Published: (2026)

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges
by: Hong, Yihan, et al.
Published: (2026)

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026)

FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge
by: Yang, Bo, et al.
Published: (2026)

AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
by: Zhou, Karen, et al.
Published: (2026)

Can LLM be a Personalized Judge?
by: Dong, Yijiang River, et al.
Published: (2024)

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
by: Lee, Sua, et al.
Published: (2026)

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges
by: Tapwal, Riya, et al.
Published: (2026)

Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge
by: Zhang, Qiyuan, et al.
Published: (2025)

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
by: Zhang, Hongbin, et al.
Published: (2026)

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models
by: Liu, Jin, et al.
Published: (2024)

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
by: Zhou, Yilun, et al.
Published: (2025)

Improve LLM-as-a-Judge Ability as a General Ability
by: Yu, Jiachen, et al.
Published: (2025)

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
by: Huang, Hui, et al.
Published: (2024)

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
by: Saha, Swarnadeep, et al.
Published: (2025)

Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)

Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge
by: Lau, Fiona
Published: (2026)

Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation
by: Lee, Dongryeol, et al.
Published: (2026)

Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge
by: Spiliopoulou, Evangelia, et al.
Published: (2025)

Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge
by: Xu, Yuzheng, et al.
Published: (2026)

JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
by: Ye, Jiayi, et al.
Published: (2024)

Explaining Length Bias in LLM-Based Preference Evaluations
by: Hu, Zhengyu, et al.
Published: (2024)

On Evaluating LLM Alignment by Evaluating LLMs as Judges
by: Liu, Yixin, et al.
Published: (2025)

BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
by: Tong, Terry, et al.
Published: (2025)

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
by: Li, Haitao, et al.
Published: (2024)

Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple LLM Judges
by: Tang, Yuqi, et al.
Published: (2025)