Saved in:
| Main Authors: | Han, Songhee, Shin, Jueun, Han, Jiyoon, Jun, Bung-Woo, Karabatman, Hilal Ayan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.00008 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement
by: Han, Steve, et al.
Published: (2025)
by: Han, Steve, et al.
Published: (2025)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
by: Wang, Yidong, et al.
Published: (2025)
by: Wang, Yidong, et al.
Published: (2025)
From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset
by: Yoo, Haneul, et al.
Published: (2026)
by: Yoo, Haneul, et al.
Published: (2026)
Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation
by: Myung, Jiyoon, et al.
Published: (2024)
by: Myung, Jiyoon, et al.
Published: (2024)
Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations
by: Gajcin, Jasmina, et al.
Published: (2025)
by: Gajcin, Jasmina, et al.
Published: (2025)
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)
by: Shi, Lin, et al.
Published: (2024)
EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
by: Mohammadi, Hadi, et al.
Published: (2025)
by: Mohammadi, Hadi, et al.
Published: (2025)
SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification
by: Yoon, Kanghoon, et al.
Published: (2025)
by: Yoon, Kanghoon, et al.
Published: (2025)
Why teaching resists automation in an AI-inundated era: Human judgment, non-modular work, and the limits of delegation
by: Han, Songhee
Published: (2026)
by: Han, Songhee
Published: (2026)
SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
by: Park, Sungho, et al.
Published: (2026)
by: Park, Sungho, et al.
Published: (2026)
LLM-Based Offline Learning for Embodied Agents via Consistency-Guided Reward Ensemble
by: Lee, Yujeong, et al.
Published: (2024)
by: Lee, Yujeong, et al.
Published: (2024)
Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?
by: Nan, Yang, et al.
Published: (2025)
by: Nan, Yang, et al.
Published: (2025)
Leveraging Large Language Models for Generating Labeled Mineral Site Record Linkage Data
by: Pyo, Jiyoon, et al.
Published: (2024)
by: Pyo, Jiyoon, et al.
Published: (2024)
A Survey on LLM-as-a-Judge
by: Gu, Jiawei, et al.
Published: (2024)
by: Gu, Jiawei, et al.
Published: (2024)
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)
by: Clegg, Kester, et al.
Published: (2025)
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
by: Tong, Terry, et al.
Published: (2025)
by: Tong, Terry, et al.
Published: (2025)
Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines
by: Saraogi, Devesh, et al.
Published: (2025)
by: Saraogi, Devesh, et al.
Published: (2025)
HyST: LLM-Powered Hybrid Retrieval over Semi-Structured Tabular Data
by: Myung, Jiyoon, et al.
Published: (2025)
by: Myung, Jiyoon, et al.
Published: (2025)
The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?
by: Retkowski, Fabian, et al.
Published: (2025)
by: Retkowski, Fabian, et al.
Published: (2025)
LLM-as-a-Judge for Time Series Explanations
by: Sivalingam, Preetham, et al.
Published: (2026)
by: Sivalingam, Preetham, et al.
Published: (2026)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models
by: Fan, Shengda, et al.
Published: (2024)
by: Fan, Shengda, et al.
Published: (2024)
Deep Literature Survey Automation with an Iterative Workflow
by: Zhang, Hongbo, et al.
Published: (2025)
by: Zhang, Hongbo, et al.
Published: (2025)
An LLM + ASP Workflow for Joint Entity-Relation Extraction
by: Tran, Trang, et al.
Published: (2025)
by: Tran, Trang, et al.
Published: (2025)
The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows
by: Kim, Hyunwoo, et al.
Published: (2026)
by: Kim, Hyunwoo, et al.
Published: (2026)
User Perceptions vs. Proxy LLM Judges: Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios
by: Wu, Xiaoyuan, et al.
Published: (2025)
by: Wu, Xiaoyuan, et al.
Published: (2025)
Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research
by: Trott, Sean
Published: (2025)
by: Trott, Sean
Published: (2025)
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
by: Jiang, Hongchao, et al.
Published: (2025)
by: Jiang, Hongchao, et al.
Published: (2025)
VERT: Reliable LLM Judges for Radiology Report Evaluation
by: Bologna, Federica, et al.
Published: (2026)
by: Bologna, Federica, et al.
Published: (2026)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
by: Ye, Jiayi, et al.
Published: (2024)
by: Ye, Jiayi, et al.
Published: (2024)
Are We on the Right Way to Assessing LLM-as-a-Judge?
by: Feng, Yuanning, et al.
Published: (2025)
by: Feng, Yuanning, et al.
Published: (2025)
Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
by: Avinash, Karthik, et al.
Published: (2025)
by: Avinash, Karthik, et al.
Published: (2025)
Toward Automated Simulation Research Workflow through LLM Prompt Engineering Design
by: Liu, Zhihan, et al.
Published: (2024)
by: Liu, Zhihan, et al.
Published: (2024)
Affording Process Auditability with QualAnalyzer: An Atomistic LLM Analysis Tool for Qualitative Research
by: Lu, Max Hao, et al.
Published: (2026)
by: Lu, Max Hao, et al.
Published: (2026)
Reference-Free Rating of LLM Responses via Latent Information
by: Girrbach, Leander, et al.
Published: (2025)
by: Girrbach, Leander, et al.
Published: (2025)
Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
by: Reese, May Lynn, et al.
Published: (2026)
by: Reese, May Lynn, et al.
Published: (2026)
Creative Beam Search: LLM-as-a-Judge For Improving Response Generation
by: Franceschelli, Giorgio, et al.
Published: (2024)
by: Franceschelli, Giorgio, et al.
Published: (2024)
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by: Thakur, Aman Singh, et al.
Published: (2024)
by: Thakur, Aman Singh, et al.
Published: (2024)
Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes
by: Jiao, Rui, et al.
Published: (2025)
by: Jiao, Rui, et al.
Published: (2025)
LLM4Sweat: A Trustworthy Large Language Model for Hyperhidrosis Support
by: Lin, Wenjie, et al.
Published: (2025)
by: Lin, Wenjie, et al.
Published: (2025)
Similar Items
-
Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement
by: Han, Steve, et al.
Published: (2025) -
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
by: Wang, Yidong, et al.
Published: (2025) -
From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset
by: Yoo, Haneul, et al.
Published: (2026) -
Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation
by: Myung, Jiyoon, et al.
Published: (2024) -
Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations
by: Gajcin, Jasmina, et al.
Published: (2025)