Saved in:
| Main Author: | Zhang, Xinran |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.28005 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation
by: Lee, Dongryeol, et al.
Published: (2026)
by: Lee, Dongryeol, et al.
Published: (2026)
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026)
by: Belmadani, Ikram, et al.
Published: (2026)
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
by: Zhang, Xinran
Published: (2026)
by: Zhang, Xinran
Published: (2026)
Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition
by: Sheth, Ivaxi, et al.
Published: (2026)
by: Sheth, Ivaxi, et al.
Published: (2026)
JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)
Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation
by: Adib, Shefayat E Shams, et al.
Published: (2026)
by: Adib, Shefayat E Shams, et al.
Published: (2026)
Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization
by: Elganayni, Mohamed Hesham, et al.
Published: (2026)
by: Elganayni, Mohamed Hesham, et al.
Published: (2026)
Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses
by: Ho, Xanh, et al.
Published: (2025)
by: Ho, Xanh, et al.
Published: (2025)
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
by: Wei, Hui, et al.
Published: (2024)
by: Wei, Hui, et al.
Published: (2024)
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA
by: Badshah, Sher, et al.
Published: (2024)
by: Badshah, Sher, et al.
Published: (2024)
Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics
by: Hua, Yilun, et al.
Published: (2026)
by: Hua, Yilun, et al.
Published: (2026)
Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations
by: Sultan, Md Arafat, et al.
Published: (2024)
by: Sultan, Md Arafat, et al.
Published: (2024)
CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement
by: Zhang, Wentao, et al.
Published: (2025)
by: Zhang, Wentao, et al.
Published: (2025)
Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study
by: Xie, Wenwen, et al.
Published: (2025)
by: Xie, Wenwen, et al.
Published: (2025)
Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
by: Song, Mingyang, et al.
Published: (2026)
by: Song, Mingyang, et al.
Published: (2026)
Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation
by: Lin, Wei-Hsiang, et al.
Published: (2025)
by: Lin, Wei-Hsiang, et al.
Published: (2025)
Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models
by: Le, Hieu Xuan, et al.
Published: (2026)
by: Le, Hieu Xuan, et al.
Published: (2026)
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
by: Huang, Hui, et al.
Published: (2024)
by: Huang, Hui, et al.
Published: (2024)
From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges
by: Hong, Yihan, et al.
Published: (2026)
by: Hong, Yihan, et al.
Published: (2026)
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)
by: Shi, Lin, et al.
Published: (2024)
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
by: Zhang, Qiyuan, et al.
Published: (2024)
by: Zhang, Qiyuan, et al.
Published: (2024)
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
by: Gisserot-Boukhlef, Hippolyte, et al.
Published: (2026)
by: Gisserot-Boukhlef, Hippolyte, et al.
Published: (2026)
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
by: Zhu, Ziyi, et al.
Published: (2026)
by: Zhu, Ziyi, et al.
Published: (2026)
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
by: Krumdick, Michael, et al.
Published: (2025)
by: Krumdick, Michael, et al.
Published: (2025)
Same Content, Different Representations: A Controlled Study for Table QA
by: Zhang, Yue, et al.
Published: (2025)
by: Zhang, Yue, et al.
Published: (2025)
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
by: Zhou, Yilun, et al.
Published: (2025)
by: Zhou, Yilun, et al.
Published: (2025)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks
by: Maloyan, Narek, et al.
Published: (2025)
by: Maloyan, Narek, et al.
Published: (2025)
Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)
by: Li, Qingquan, et al.
Published: (2025)
Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry
by: Li, Zhuochun, et al.
Published: (2026)
by: Li, Zhuochun, et al.
Published: (2026)
Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory
by: Sun, Jingwei, et al.
Published: (2026)
by: Sun, Jingwei, et al.
Published: (2026)
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)
by: Clegg, Kester, et al.
Published: (2025)
SEC-QA: A Systematic Evaluation Corpus for Financial QA
by: Lai, Viet Dac, et al.
Published: (2024)
by: Lai, Viet Dac, et al.
Published: (2024)
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
by: Chen, Junjie, et al.
Published: (2026)
by: Chen, Junjie, et al.
Published: (2026)
BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA
by: Jonker, Richard A. A., et al.
Published: (2026)
by: Jonker, Richard A. A., et al.
Published: (2026)
JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction
by: Zhan, Yuhao, et al.
Published: (2025)
by: Zhan, Yuhao, et al.
Published: (2025)
Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking
by: Zhang, Yichi, et al.
Published: (2024)
by: Zhang, Yichi, et al.
Published: (2024)
An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability
by: Yamauchi, Yusuke, et al.
Published: (2025)
by: Yamauchi, Yusuke, et al.
Published: (2025)
sebis at ArchEHR-QA 2026: How Much Can You Do Locally? Evaluating Grounded EHR QA on a Single Notebook
by: Yurt, Ibrahim Ebrar, et al.
Published: (2026)
by: Yurt, Ibrahim Ebrar, et al.
Published: (2026)
Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering
by: Bogireddy, Sai Prasanna Teja Reddy, et al.
Published: (2025)
by: Bogireddy, Sai Prasanna Teja Reddy, et al.
Published: (2025)
Similar Items
-
Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation
by: Lee, Dongryeol, et al.
Published: (2026) -
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026) -
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
by: Zhang, Xinran
Published: (2026) -
Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition
by: Sheth, Ivaxi, et al.
Published: (2026) -
JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)