Saved in:
| Main Authors: | Lee, Ayoung, Kwon, Ryan Sungmo, Railton, Peter, Wang, Lu |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.10823 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation
by: Bi, Zhenyu, et al.
Published: (2025)
by: Bi, Zhenyu, et al.
Published: (2025)
Implicit Behavioral Alignment of Language Agents in High-Stakes Crowd Simulations
by: Wang, Yunzhe, et al.
Published: (2025)
by: Wang, Yunzhe, et al.
Published: (2025)
SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification
by: Yoon, Kanghoon, et al.
Published: (2025)
by: Yoon, Kanghoon, et al.
Published: (2025)
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
by: Zhu, Lianghui, et al.
Published: (2023)
by: Zhu, Lianghui, et al.
Published: (2023)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation
by: Khalifa, Muhammad, et al.
Published: (2026)
by: Khalifa, Muhammad, et al.
Published: (2026)
StakeBench: Evaluating Language Understanding Grounded in Market Commitment
by: Pei, Yunhua, et al.
Published: (2026)
by: Pei, Yunhua, et al.
Published: (2026)
Understanding the Dilemma of Unlearning for Large Language Models
by: Zhang, Qingjie, et al.
Published: (2025)
by: Zhang, Qingjie, et al.
Published: (2025)
LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?
by: Zou, Kaijian, et al.
Published: (2025)
by: Zou, Kaijian, et al.
Published: (2025)
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by: Thakur, Aman Singh, et al.
Published: (2024)
by: Thakur, Aman Singh, et al.
Published: (2024)
Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
by: Zhang, Yichi, et al.
Published: (2026)
by: Zhang, Yichi, et al.
Published: (2026)
Who is a Better Matchmaker? Human vs. Algorithmic Judge Assignment in a High-Stakes Startup Competition
by: Xi, Sarina, et al.
Published: (2025)
by: Xi, Sarina, et al.
Published: (2025)
Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant
by: Lee, Jemin, et al.
Published: (2024)
by: Lee, Jemin, et al.
Published: (2024)
A Distributional Perspective on Word Learning in Neural Language Models
by: Ficarra, Filippo, et al.
Published: (2025)
by: Ficarra, Filippo, et al.
Published: (2025)
Is LLM an Overconfident Judge? Unveiling the Capabilities of LLMs in Detecting Offensive Language with Annotation Disagreement
by: Lu, Junyu, et al.
Published: (2025)
by: Lu, Junyu, et al.
Published: (2025)
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
by: Zhang, Bang, et al.
Published: (2025)
by: Zhang, Bang, et al.
Published: (2025)
JudgeLRM: Large Reasoning Models as a Judge
by: Chen, Nuo, et al.
Published: (2025)
by: Chen, Nuo, et al.
Published: (2025)
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
by: Lee, Sua, et al.
Published: (2026)
by: Lee, Sua, et al.
Published: (2026)
Context Over Content: Exposing Evaluation Faking in Automated Judges
by: Gupta, Manan, et al.
Published: (2026)
by: Gupta, Manan, et al.
Published: (2026)
Logit Arithmetic Elicits Long Reasoning Capabilities Without Training
by: Zhang, Yunxiang, et al.
Published: (2025)
by: Zhang, Yunxiang, et al.
Published: (2025)
EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
by: Mohammadi, Hadi, et al.
Published: (2025)
by: Mohammadi, Hadi, et al.
Published: (2025)
M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models
by: Kwon, Yejin, et al.
Published: (2025)
by: Kwon, Yejin, et al.
Published: (2025)
Can Vision Language Models Judge Action Quality? An Empirical Evaluation
by: Freitas, Miguel Monte e, et al.
Published: (2026)
by: Freitas, Miguel Monte e, et al.
Published: (2026)
Identifying Multiple Personalities in Large Language Models with External Evaluation
by: Song, Xiaoyang, et al.
Published: (2024)
by: Song, Xiaoyang, et al.
Published: (2024)
Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models
by: Kumar, Shachi H, et al.
Published: (2024)
by: Kumar, Shachi H, et al.
Published: (2024)
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)
by: Clegg, Kester, et al.
Published: (2025)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
Audio-Aware Large Language Models as Judges for Speaking Styles
by: Chiang, Cheng-Han, et al.
Published: (2025)
by: Chiang, Cheng-Han, et al.
Published: (2025)
Evaluating the Retrieval Robustness of Large Language Models
by: Cao, Shuyang, et al.
Published: (2025)
by: Cao, Shuyang, et al.
Published: (2025)
On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions
by: Nguyen, Dang, et al.
Published: (2025)
by: Nguyen, Dang, et al.
Published: (2025)
Meta-Judging with Large Language Models: Concepts, Methods, and Challenges
by: Silva, Hugo, et al.
Published: (2026)
by: Silva, Hugo, et al.
Published: (2026)
GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation
by: Lee, Jeongsoo, et al.
Published: (2025)
by: Lee, Jeongsoo, et al.
Published: (2025)
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
by: Saha, Swarnadeep, et al.
Published: (2025)
by: Saha, Swarnadeep, et al.
Published: (2025)
Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective
by: Kou, Zhiqiang, et al.
Published: (2025)
by: Kou, Zhiqiang, et al.
Published: (2025)
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
by: Cao, Maosong, et al.
Published: (2024)
by: Cao, Maosong, et al.
Published: (2024)
Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards
by: Wei, Xiaolong, et al.
Published: (2025)
by: Wei, Xiaolong, et al.
Published: (2025)
Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas: A Survey
by: Deng, Chengyuan, et al.
Published: (2024)
by: Deng, Chengyuan, et al.
Published: (2024)
A Perspective on Large Language Models, Intelligent Machines, and Knowledge Acquisition
by: Cherkassky, Vladimir, et al.
Published: (2024)
by: Cherkassky, Vladimir, et al.
Published: (2024)
Evaluating Language Models' Evaluations of Games
by: Collins, Katherine M., et al.
Published: (2025)
by: Collins, Katherine M., et al.
Published: (2025)
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
by: Lee, Jaeyun, et al.
Published: (2026)
by: Lee, Jaeyun, et al.
Published: (2026)
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking
by: Niu, Tong, et al.
Published: (2024)
by: Niu, Tong, et al.
Published: (2024)
Similar Items
-
JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation
by: Bi, Zhenyu, et al.
Published: (2025) -
Implicit Behavioral Alignment of Language Agents in High-Stakes Crowd Simulations
by: Wang, Yunzhe, et al.
Published: (2025) -
SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification
by: Yoon, Kanghoon, et al.
Published: (2025) -
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
by: Zhu, Lianghui, et al.
Published: (2023) -
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation
by: Khalifa, Muhammad, et al.
Published: (2026)