:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Lau, Fiona
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2603.04417
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
by: Kostić, Bogdan, et al.
Published: (2026)

Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks
by: Haldar, Rajarshi, et al.
Published: (2025)

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
by: Wang, Yidong, et al.
Published: (2025)

LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories
by: Vishnubhotla, Krishnapriya, et al.
Published: (2026)

Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)

Different Demographic Cues Yield Inconsistent Conclusions About LLM Personalization and Bias
by: Tonneau, Manuel, et al.
Published: (2026)

Same Content, Different Representations: A Controlled Study for Table QA
by: Zhang, Yue, et al.
Published: (2025)

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering
by: Molfese, Francesco Maria, et al.
Published: (2025)

Personalized LLM for Generating Customized Responses to the Same Query from Different Users
by: Zeng, Hang, et al.
Published: (2024)

The Same But Different: Structural Similarities and Differences in Multilingual Language Modeling
by: Zhang, Ruochen, et al.
Published: (2024)

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)

AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
by: Zhou, Karen, et al.
Published: (2026)

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
by: Tang, Zhenwei, et al.
Published: (2026)

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
by: Fujinuma, Yoshinari
Published: (2025)

Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing
by: Ahn, Jihyun Janice, et al.
Published: (2025)

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
by: Huang, Hui, et al.
Published: (2024)

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models
by: Lin, Zizhuo, et al.
Published: (2026)

Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport
by: Morandi, Andrea
Published: (2026)

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
by: Levy, Mosh, et al.
Published: (2024)

Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?
by: Barnes, Jeremy, et al.
Published: (2025)

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges
by: Hong, Yihan, et al.
Published: (2026)

When LLM Judge Scores Look Good but Best-of-N Decisions Fail
by: Landesberg, Eddie
Published: (2026)

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG
by: Li, Yubo, et al.
Published: (2026)

Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection
by: Sagimbayeva, Nursulu, et al.
Published: (2025)

Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction
by: Kamoi, Ryo, et al.
Published: (2026)

FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge
by: Yang, Bo, et al.
Published: (2026)

The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
by: Marioriyad, Arash, et al.
Published: (2025)

EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models
by: Su, Jiamin, et al.
Published: (2025)

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs
by: Alkaeed, Mahdi, et al.
Published: (2026)

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026)

AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling
by: Miao, Yongliang, et al.
Published: (2026)

Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge
by: Wu, Junjie, et al.
Published: (2026)

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs
by: Ford, Casey, et al.
Published: (2026)

AmbigDocs: Reasoning across Documents on Different Entities under the Same Name
by: Lee, Yoonsang, et al.
Published: (2024)

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
by: Zhu, Ziyi, et al.
Published: (2026)

Quantitative LLM Judges
by: Sahoo, Aishwarya, et al.
Published: (2025)

Judging It, Washing It: Scoring and Greenwashing Corporate Climate Disclosures using Large Language Models
by: Chuang, Marianne, et al.
Published: (2025)

Inconsistencies in Masked Language Models
by: Young, Tom, et al.
Published: (2022)

Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization
by: Ma, Jing
Published: (2025)