:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Zhang, Xinran
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2603.28005
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation
by: Lee, Dongryeol, et al.
Published: (2026)

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026)

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
by: Zhang, Xinran
Published: (2026)

Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition
by: Sheth, Ivaxi, et al.
Published: (2026)

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)

Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation
by: Adib, Shefayat E Shams, et al.
Published: (2026)

Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization
by: Elganayni, Mohamed Hesham, et al.
Published: (2026)

Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses
by: Ho, Xanh, et al.
Published: (2025)

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
by: Wei, Hui, et al.
Published: (2024)

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA
by: Badshah, Sher, et al.
Published: (2024)

Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics
by: Hua, Yilun, et al.
Published: (2026)

Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations
by: Sultan, Md Arafat, et al.
Published: (2024)

CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement
by: Zhang, Wentao, et al.
Published: (2025)

Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study
by: Xie, Wenwen, et al.
Published: (2025)

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
by: Song, Mingyang, et al.
Published: (2026)

Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation
by: Lin, Wei-Hsiang, et al.
Published: (2025)

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models
by: Le, Hieu Xuan, et al.
Published: (2026)

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
by: Huang, Hui, et al.
Published: (2024)

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges
by: Hong, Yihan, et al.
Published: (2026)

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
by: Zhang, Qiyuan, et al.
Published: (2024)

BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
by: Gisserot-Boukhlef, Hippolyte, et al.
Published: (2026)

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
by: Zhu, Ziyi, et al.
Published: (2026)

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
by: Krumdick, Michael, et al.
Published: (2025)

Same Content, Different Representations: A Controlled Study for Table QA
by: Zhang, Yue, et al.
Published: (2025)

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
by: Zhou, Yilun, et al.
Published: (2025)

JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)

Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks
by: Maloyan, Narek, et al.
Published: (2025)

Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)

Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry
by: Li, Zhuochun, et al.
Published: (2026)

Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory
by: Sun, Jingwei, et al.
Published: (2026)

Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)

SEC-QA: A Systematic Evaluation Corpus for Financial QA
by: Lai, Viet Dac, et al.
Published: (2024)

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
by: Chen, Junjie, et al.
Published: (2026)

BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA
by: Jonker, Richard A. A., et al.
Published: (2026)

JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction
by: Zhan, Yuhao, et al.
Published: (2025)

Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking
by: Zhang, Yichi, et al.
Published: (2024)

An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability
by: Yamauchi, Yusuke, et al.
Published: (2025)

sebis at ArchEHR-QA 2026: How Much Can You Do Locally? Evaluating Grounded EHR QA on a Single Notebook
by: Yurt, Ibrahim Ebrar, et al.
Published: (2026)

Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering
by: Bogireddy, Sai Prasanna Teja Reddy, et al.
Published: (2025)