:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Han, Songhee, Shin, Jueun, Han, Jiyoon, Jun, Bung-Woo, Karabatman, Hilal Ayan
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.00008
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement
by: Han, Steve, et al.
Published: (2025)

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
by: Wang, Yidong, et al.
Published: (2025)

From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset
by: Yoo, Haneul, et al.
Published: (2026)

Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation
by: Myung, Jiyoon, et al.
Published: (2024)

Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations
by: Gajcin, Jasmina, et al.
Published: (2025)

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)

EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
by: Mohammadi, Hadi, et al.
Published: (2025)

SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification
by: Yoon, Kanghoon, et al.
Published: (2025)

Why teaching resists automation in an AI-inundated era: Human judgment, non-modular work, and the limits of delegation
by: Han, Songhee
Published: (2026)

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
by: Park, Sungho, et al.
Published: (2026)

LLM-Based Offline Learning for Embodied Agents via Consistency-Guided Reward Ensemble
by: Lee, Yujeong, et al.
Published: (2024)

Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?
by: Nan, Yang, et al.
Published: (2025)

Leveraging Large Language Models for Generating Labeled Mineral Site Record Linkage Data
by: Pyo, Jiyoon, et al.
Published: (2024)

A Survey on LLM-as-a-Judge
by: Gu, Jiawei, et al.
Published: (2024)

Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)

BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
by: Tong, Terry, et al.
Published: (2025)

Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines
by: Saraogi, Devesh, et al.
Published: (2025)

HyST: LLM-Powered Hybrid Retrieval over Semi-Structured Tabular Data
by: Myung, Jiyoon, et al.
Published: (2025)

The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?
by: Retkowski, Fabian, et al.
Published: (2025)

LLM-as-a-Judge for Time Series Explanations
by: Sivalingam, Preetham, et al.
Published: (2026)

JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models
by: Fan, Shengda, et al.
Published: (2024)

Deep Literature Survey Automation with an Iterative Workflow
by: Zhang, Hongbo, et al.
Published: (2025)

An LLM + ASP Workflow for Joint Entity-Relation Extraction
by: Tran, Trang, et al.
Published: (2025)

The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows
by: Kim, Hyunwoo, et al.
Published: (2026)

User Perceptions vs. Proxy LLM Judges: Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios
by: Wu, Xiaoyuan, et al.
Published: (2025)

Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research
by: Trott, Sean
Published: (2025)

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
by: Jiang, Hongchao, et al.
Published: (2025)

VERT: Reliable LLM Judges for Radiology Report Evaluation
by: Bologna, Federica, et al.
Published: (2026)

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
by: Ye, Jiayi, et al.
Published: (2024)

Are We on the Right Way to Assessing LLM-as-a-Judge?
by: Feng, Yuanning, et al.
Published: (2025)

Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
by: Avinash, Karthik, et al.
Published: (2025)

Toward Automated Simulation Research Workflow through LLM Prompt Engineering Design
by: Liu, Zhihan, et al.
Published: (2024)

Affording Process Auditability with QualAnalyzer: An Atomistic LLM Analysis Tool for Qualitative Research
by: Lu, Max Hao, et al.
Published: (2026)

Reference-Free Rating of LLM Responses via Latent Information
by: Girrbach, Leander, et al.
Published: (2025)

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
by: Reese, May Lynn, et al.
Published: (2026)

Creative Beam Search: LLM-as-a-Judge For Improving Response Generation
by: Franceschelli, Giorgio, et al.
Published: (2024)

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by: Thakur, Aman Singh, et al.
Published: (2024)

Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes
by: Jiao, Rui, et al.
Published: (2025)

LLM4Sweat: A Trustworthy Large Language Model for Hyperhidrosis Support
by: Lin, Wenjie, et al.
Published: (2025)