Saved in:
| Main Authors: | Sheng, Huanxin, Liu, Xinyi, He, Hangfeng, Zhao, Jieyu, Kang, Jian |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.18658 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
The Role of Model Confidence on Bias Effects in Measured Uncertainties for Vision-Language Models
by: Liu, Xinyi, et al.
Published: (2025)
by: Liu, Xinyi, et al.
Published: (2025)
An Empirical Analysis on Large Language Models in Debate Evaluation
by: Liu, Xinyi, et al.
Published: (2024)
by: Liu, Xinyi, et al.
Published: (2024)
Video-Based Reward Modeling for Computer-Use Agents
by: Song, Linxin, et al.
Published: (2026)
by: Song, Linxin, et al.
Published: (2026)
A Law of Next-Token Prediction in Large Language Models
by: He, Hangfeng, et al.
Published: (2024)
by: He, Hangfeng, et al.
Published: (2024)
SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation
by: He, Hangfeng, et al.
Published: (2023)
by: He, Hangfeng, et al.
Published: (2023)
TreeRare: Syntax Tree-Guided Retrieval and Reasoning for Knowledge-Intensive Question Answering
by: Zhang, Boyi, et al.
Published: (2025)
by: Zhang, Boyi, et al.
Published: (2025)
Same Company, Same Signal: The Role of Identity in Earnings Call Transcripts
by: Yu, Ding, et al.
Published: (2024)
by: Yu, Ding, et al.
Published: (2024)
Unveiling Divergent Inductive Biases of LLMs on Temporal Data
by: Kishore, Sindhu, et al.
Published: (2024)
by: Kishore, Sindhu, et al.
Published: (2024)
On the Role of Model Prior in Real-World Inductive Reasoning
by: Liu, Zhuo, et al.
Published: (2024)
by: Liu, Zhuo, et al.
Published: (2024)
Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting
by: Wu, Jiarui, et al.
Published: (2025)
by: Wu, Jiarui, et al.
Published: (2025)
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
by: Gupta, Manan, et al.
Published: (2026)
by: Gupta, Manan, et al.
Published: (2026)
Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation
by: Lee, Dongryeol, et al.
Published: (2026)
by: Lee, Dongryeol, et al.
Published: (2026)
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
by: Huang, Hui, et al.
Published: (2024)
by: Huang, Hui, et al.
Published: (2024)
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
by: Lee, Dongryeol, et al.
Published: (2024)
by: Lee, Dongryeol, et al.
Published: (2024)
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches
by: Zhou, Yuhang, et al.
Published: (2025)
by: Zhou, Yuhang, et al.
Published: (2025)
Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation
by: Moon, Jiwon, et al.
Published: (2025)
by: Moon, Jiwon, et al.
Published: (2025)
Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base
by: Song, Linxin, et al.
Published: (2025)
by: Song, Linxin, et al.
Published: (2025)
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
by: Wei, Hui, et al.
Published: (2024)
by: Wei, Hui, et al.
Published: (2024)
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026)
by: Belmadani, Ikram, et al.
Published: (2026)
On Evaluating LLM Alignment by Evaluating LLMs as Judges
by: Liu, Yixin, et al.
Published: (2025)
by: Liu, Yixin, et al.
Published: (2025)
PANDA -- Paired Anti-hate Narratives Dataset from Asia: Using an LLM-as-a-Judge to Create the First Chinese Counterspeech Dataset
by: Bennie, Michael, et al.
Published: (2025)
by: Bennie, Michael, et al.
Published: (2025)
Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)
by: Li, Qingquan, et al.
Published: (2025)
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
by: Chen, Junjie, et al.
Published: (2026)
by: Chen, Junjie, et al.
Published: (2026)
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
by: Zhu, Ziyi, et al.
Published: (2026)
by: Zhu, Ziyi, et al.
Published: (2026)
Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation
by: Lin, Wei-Hsiang, et al.
Published: (2025)
by: Lin, Wei-Hsiang, et al.
Published: (2025)
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
by: Zhou, Yilun, et al.
Published: (2025)
by: Zhou, Yilun, et al.
Published: (2025)
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)
by: Clegg, Kester, et al.
Published: (2025)
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
by: Tong, Terry, et al.
Published: (2025)
by: Tong, Terry, et al.
Published: (2025)
A Survey on LLM-as-a-Judge
by: Gu, Jiawei, et al.
Published: (2024)
by: Gu, Jiawei, et al.
Published: (2024)
One Token to Fool LLM-as-a-Judge
by: Zhao, Yulai, et al.
Published: (2025)
by: Zhao, Yulai, et al.
Published: (2025)
How Reliable is Multilingual LLM-as-a-Judge?
by: Fu, Xiyan, et al.
Published: (2025)
by: Fu, Xiyan, et al.
Published: (2025)
COPU: Conformal Prediction for Uncertainty Quantification in Natural Language Generation
by: Wang, Sean, et al.
Published: (2025)
by: Wang, Sean, et al.
Published: (2025)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
by: Lai, Peng, et al.
Published: (2026)
by: Lai, Peng, et al.
Published: (2026)
Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization
by: Zhou, Hongli, et al.
Published: (2026)
by: Zhou, Hongli, et al.
Published: (2026)
Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge
by: Zhang, Qiyuan, et al.
Published: (2025)
by: Zhang, Qiyuan, et al.
Published: (2025)
Think-J: Learning to Think for Generative LLM-as-a-Judge
by: Huang, Hui, et al.
Published: (2025)
by: Huang, Hui, et al.
Published: (2025)
The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
by: Marioriyad, Arash, et al.
Published: (2025)
by: Marioriyad, Arash, et al.
Published: (2025)
FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge
by: Yang, Bo, et al.
Published: (2026)
by: Yang, Bo, et al.
Published: (2026)
How to Correctly Report LLM-as-a-Judge Evaluations
by: Lee, Chungpa, et al.
Published: (2025)
by: Lee, Chungpa, et al.
Published: (2025)
Similar Items
-
The Role of Model Confidence on Bias Effects in Measured Uncertainties for Vision-Language Models
by: Liu, Xinyi, et al.
Published: (2025) -
An Empirical Analysis on Large Language Models in Debate Evaluation
by: Liu, Xinyi, et al.
Published: (2024) -
Video-Based Reward Modeling for Computer-Use Agents
by: Song, Linxin, et al.
Published: (2026) -
A Law of Next-Token Prediction in Large Language Models
by: He, Hangfeng, et al.
Published: (2024) -
SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation
by: He, Hangfeng, et al.
Published: (2023)