Saved in:
| Main Authors: | Sivalingam, Preetham, Mandal, Murari, Deshpande, Saurabh, Kumar, Dhruv |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.02118 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Confidence is Not Competence
by: Sanyal, Debdeep, et al.
Published: (2025)
by: Sanyal, Debdeep, et al.
Published: (2025)
time2time: Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models
by: Sanyal, Debdeep, et al.
Published: (2025)
by: Sanyal, Debdeep, et al.
Published: (2025)
Agents Are All You Need for LLM Unlearning
by: Sanyal, Debdeep, et al.
Published: (2025)
by: Sanyal, Debdeep, et al.
Published: (2025)
Measuring Representation Robustness in Large Language Models for Geometry
by: Jawandhia, Vedant, et al.
Published: (2026)
by: Jawandhia, Vedant, et al.
Published: (2026)
ReviewEval: An Evaluation Framework for AI-Generated Reviews
by: Garg, Madhav Krishan, et al.
Published: (2025)
by: Garg, Madhav Krishan, et al.
Published: (2025)
Policy Optimization Prefers The Path of Least Resistance
by: Sanyal, Debdeep, et al.
Published: (2025)
by: Sanyal, Debdeep, et al.
Published: (2025)
REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM
by: Jindal, Madhur, et al.
Published: (2025)
by: Jindal, Madhur, et al.
Published: (2025)
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
by: Gupta, Manan, et al.
Published: (2026)
by: Gupta, Manan, et al.
Published: (2026)
When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection
by: Sahoo, Devanshu, et al.
Published: (2025)
by: Sahoo, Devanshu, et al.
Published: (2025)
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
by: Agarwal, Parth, et al.
Published: (2025)
by: Agarwal, Parth, et al.
Published: (2025)
UnStar: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs
by: Sinha, Yash, et al.
Published: (2024)
by: Sinha, Yash, et al.
Published: (2024)
Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations
by: Gajcin, Jasmina, et al.
Published: (2025)
by: Gajcin, Jasmina, et al.
Published: (2025)
Context Over Content: Exposing Evaluation Faking in Automated Judges
by: Gupta, Manan, et al.
Published: (2026)
by: Gupta, Manan, et al.
Published: (2026)
The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation
by: Sahoo, Devanshu, et al.
Published: (2026)
by: Sahoo, Devanshu, et al.
Published: (2026)
Nine Ways to Break Copyright Law and Why Our LLM Won't: A Fair Use Aligned Generation Framework
by: Sharma, Aakash Sen, et al.
Published: (2025)
by: Sharma, Aakash Sen, et al.
Published: (2025)
Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines
by: Saraogi, Devesh, et al.
Published: (2025)
by: Saraogi, Devesh, et al.
Published: (2025)
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)
by: Shi, Lin, et al.
Published: (2024)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
by: Wang, Yidong, et al.
Published: (2025)
by: Wang, Yidong, et al.
Published: (2025)
A Hybrid Supervised-LLM Pipeline for Actionable Suggestion Mining in Unstructured Customer Reviews
by: Trivedi, Aakash, et al.
Published: (2026)
by: Trivedi, Aakash, et al.
Published: (2026)
Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering
by: Gupta, Manan, et al.
Published: (2026)
by: Gupta, Manan, et al.
Published: (2026)
A Survey on LLM-as-a-Judge
by: Gu, Jiawei, et al.
Published: (2024)
by: Gu, Jiawei, et al.
Published: (2024)
SwissNYF: Tool Grounded LLM Agents for Black Box Setting
by: Kumar, Somnath Sendhil, et al.
Published: (2024)
by: Kumar, Somnath Sendhil, et al.
Published: (2024)
Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions
by: Patil, Parth, et al.
Published: (2026)
by: Patil, Parth, et al.
Published: (2026)
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
by: Tong, Terry, et al.
Published: (2025)
by: Tong, Terry, et al.
Published: (2025)
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)
by: Clegg, Kester, et al.
Published: (2025)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
by: Ye, Jiayi, et al.
Published: (2024)
by: Ye, Jiayi, et al.
Published: (2024)
Are We on the Right Way to Assessing LLM-as-a-Judge?
by: Feng, Yuanning, et al.
Published: (2025)
by: Feng, Yuanning, et al.
Published: (2025)
Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement
by: Han, Steve, et al.
Published: (2025)
by: Han, Steve, et al.
Published: (2025)
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
by: Jiang, Hongchao, et al.
Published: (2025)
by: Jiang, Hongchao, et al.
Published: (2025)
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework
by: Liu, Fan, et al.
Published: (2024)
by: Liu, Fan, et al.
Published: (2024)
The Language of Bargaining: Linguistic Effects in LLM Negotiations
by: Sinha, Stuti, et al.
Published: (2026)
by: Sinha, Stuti, et al.
Published: (2026)
Support-Contra Asymmetry in LLM Explanations
by: Patil, Avinash
Published: (2025)
by: Patil, Avinash
Published: (2025)
Automated Concept Discovery for LLM-as-a-Judge Preference Analysis
by: Wedgwood, James, et al.
Published: (2026)
by: Wedgwood, James, et al.
Published: (2026)
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
by: Saha, Swarnadeep, et al.
Published: (2025)
by: Saha, Swarnadeep, et al.
Published: (2025)
Think-J: Learning to Think for Generative LLM-as-a-Judge
by: Huang, Hui, et al.
Published: (2025)
by: Huang, Hui, et al.
Published: (2025)
OptiHive: Ensemble Selection for LLM-Based Optimization via Statistical Modeling
by: Bouscary, Maxime, et al.
Published: (2025)
by: Bouscary, Maxime, et al.
Published: (2025)
LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation
by: Enguehard, Joseph, et al.
Published: (2025)
by: Enguehard, Joseph, et al.
Published: (2025)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows
by: Sharan, Aditya, et al.
Published: (2026)
by: Sharan, Aditya, et al.
Published: (2026)
Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom's Taxonomy
by: Kumar, Ramya, et al.
Published: (2025)
by: Kumar, Ramya, et al.
Published: (2025)
Similar Items
-
Confidence is Not Competence
by: Sanyal, Debdeep, et al.
Published: (2025) -
time2time: Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models
by: Sanyal, Debdeep, et al.
Published: (2025) -
Agents Are All You Need for LLM Unlearning
by: Sanyal, Debdeep, et al.
Published: (2025) -
Measuring Representation Robustness in Large Language Models for Geometry
by: Jawandhia, Vedant, et al.
Published: (2026) -
ReviewEval: An Evaluation Framework for AI-Generated Reviews
by: Garg, Madhav Krishan, et al.
Published: (2025)