Saved in:
| Main Author: | Lau, Fiona |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.04417 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
by: Kostić, Bogdan, et al.
Published: (2026)
by: Kostić, Bogdan, et al.
Published: (2026)
Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks
by: Haldar, Rajarshi, et al.
Published: (2025)
by: Haldar, Rajarshi, et al.
Published: (2025)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
by: Wang, Yidong, et al.
Published: (2025)
by: Wang, Yidong, et al.
Published: (2025)
LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories
by: Vishnubhotla, Krishnapriya, et al.
Published: (2026)
by: Vishnubhotla, Krishnapriya, et al.
Published: (2026)
Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)
by: Li, Qingquan, et al.
Published: (2025)
Different Demographic Cues Yield Inconsistent Conclusions About LLM Personalization and Bias
by: Tonneau, Manuel, et al.
Published: (2026)
by: Tonneau, Manuel, et al.
Published: (2026)
Same Content, Different Representations: A Controlled Study for Table QA
by: Zhang, Yue, et al.
Published: (2025)
by: Zhang, Yue, et al.
Published: (2025)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering
by: Molfese, Francesco Maria, et al.
Published: (2025)
by: Molfese, Francesco Maria, et al.
Published: (2025)
Personalized LLM for Generating Customized Responses to the Same Query from Different Users
by: Zeng, Hang, et al.
Published: (2024)
by: Zeng, Hang, et al.
Published: (2024)
The Same But Different: Structural Similarities and Differences in Multilingual Language Modeling
by: Zhang, Ruochen, et al.
Published: (2024)
by: Zhang, Ruochen, et al.
Published: (2024)
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)
by: Shi, Lin, et al.
Published: (2024)
AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
by: Zhou, Karen, et al.
Published: (2026)
by: Zhou, Karen, et al.
Published: (2026)
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
by: Tang, Zhenwei, et al.
Published: (2026)
by: Tang, Zhenwei, et al.
Published: (2026)
Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
by: Fujinuma, Yoshinari
Published: (2025)
by: Fujinuma, Yoshinari
Published: (2025)
Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing
by: Ahn, Jihyun Janice, et al.
Published: (2025)
by: Ahn, Jihyun Janice, et al.
Published: (2025)
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
by: Huang, Hui, et al.
Published: (2024)
by: Huang, Hui, et al.
Published: (2024)
Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models
by: Lin, Zizhuo, et al.
Published: (2026)
by: Lin, Zizhuo, et al.
Published: (2026)
Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport
by: Morandi, Andrea
Published: (2026)
by: Morandi, Andrea
Published: (2026)
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
by: Levy, Mosh, et al.
Published: (2024)
by: Levy, Mosh, et al.
Published: (2024)
Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?
by: Barnes, Jeremy, et al.
Published: (2025)
by: Barnes, Jeremy, et al.
Published: (2025)
From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges
by: Hong, Yihan, et al.
Published: (2026)
by: Hong, Yihan, et al.
Published: (2026)
When LLM Judge Scores Look Good but Best-of-N Decisions Fail
by: Landesberg, Eddie
Published: (2026)
by: Landesberg, Eddie
Published: (2026)
Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG
by: Li, Yubo, et al.
Published: (2026)
by: Li, Yubo, et al.
Published: (2026)
Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection
by: Sagimbayeva, Nursulu, et al.
Published: (2025)
by: Sagimbayeva, Nursulu, et al.
Published: (2025)
Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction
by: Kamoi, Ryo, et al.
Published: (2026)
by: Kamoi, Ryo, et al.
Published: (2026)
FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge
by: Yang, Bo, et al.
Published: (2026)
by: Yang, Bo, et al.
Published: (2026)
The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
by: Marioriyad, Arash, et al.
Published: (2025)
by: Marioriyad, Arash, et al.
Published: (2025)
EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models
by: Su, Jiamin, et al.
Published: (2025)
by: Su, Jiamin, et al.
Published: (2025)
JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)
Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs
by: Alkaeed, Mahdi, et al.
Published: (2026)
by: Alkaeed, Mahdi, et al.
Published: (2026)
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026)
by: Belmadani, Ikram, et al.
Published: (2026)
AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling
by: Miao, Yongliang, et al.
Published: (2026)
by: Miao, Yongliang, et al.
Published: (2026)
Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge
by: Wu, Junjie, et al.
Published: (2026)
by: Wu, Junjie, et al.
Published: (2026)
Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs
by: Ford, Casey, et al.
Published: (2026)
by: Ford, Casey, et al.
Published: (2026)
AmbigDocs: Reasoning across Documents on Different Entities under the Same Name
by: Lee, Yoonsang, et al.
Published: (2024)
by: Lee, Yoonsang, et al.
Published: (2024)
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
by: Zhu, Ziyi, et al.
Published: (2026)
by: Zhu, Ziyi, et al.
Published: (2026)
Quantitative LLM Judges
by: Sahoo, Aishwarya, et al.
Published: (2025)
by: Sahoo, Aishwarya, et al.
Published: (2025)
Judging It, Washing It: Scoring and Greenwashing Corporate Climate Disclosures using Large Language Models
by: Chuang, Marianne, et al.
Published: (2025)
by: Chuang, Marianne, et al.
Published: (2025)
Inconsistencies in Masked Language Models
by: Young, Tom, et al.
Published: (2022)
by: Young, Tom, et al.
Published: (2022)
Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization
by: Ma, Jing
Published: (2025)
by: Ma, Jing
Published: (2025)
Similar Items
-
Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
by: Kostić, Bogdan, et al.
Published: (2026) -
Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks
by: Haldar, Rajarshi, et al.
Published: (2025) -
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
by: Wang, Yidong, et al.
Published: (2025) -
LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories
by: Vishnubhotla, Krishnapriya, et al.
Published: (2026) -
Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)