Saved in:
| Main Authors: | Gao, Yicheng, Xu, Gonghan, Wang, Zhe, Cohan, Arman |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.04424 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
On Evaluating LLM Alignment by Evaluating LLMs as Judges
by: Liu, Yixin, et al.
Published: (2025)
by: Liu, Yixin, et al.
Published: (2025)
Calibrating Long-form Generations from Large Language Models
by: Huang, Yukun, et al.
Published: (2024)
by: Huang, Yukun, et al.
Published: (2024)
On the Benefits of Fine-Grained Loss Truncation: A Case Study on Factuality in Summarization
by: Flores, Lorenzo Jaime Yu, et al.
Published: (2024)
by: Flores, Lorenzo Jaime Yu, et al.
Published: (2024)
Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
by: Lee, Jinu, et al.
Published: (2025)
by: Lee, Jinu, et al.
Published: (2025)
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
by: Gao, Mingqi, et al.
Published: (2024)
by: Gao, Mingqi, et al.
Published: (2024)
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
by: Zhao, Yilun, et al.
Published: (2025)
by: Zhao, Yilun, et al.
Published: (2025)
SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing
by: Liu, Hongjun, et al.
Published: (2025)
by: Liu, Hongjun, et al.
Published: (2025)
Survey on Evaluation of LLM-based Agents
by: Yehudai, Asaf, et al.
Published: (2025)
by: Yehudai, Asaf, et al.
Published: (2025)
References Improve LLM Alignment in Non-Verifiable Domains
by: Shi, Kejian, et al.
Published: (2026)
by: Shi, Kejian, et al.
Published: (2026)
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
by: Li, Chuhan, et al.
Published: (2024)
by: Li, Chuhan, et al.
Published: (2024)
IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery
by: Garikaparthi, Aniketh, et al.
Published: (2025)
by: Garikaparthi, Aniketh, et al.
Published: (2025)
From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations
by: Wang, Benlu, et al.
Published: (2025)
by: Wang, Benlu, et al.
Published: (2025)
Investigating Data Contamination in Modern Benchmarks for Large Language Models
by: Deng, Chunyuan, et al.
Published: (2023)
by: Deng, Chunyuan, et al.
Published: (2023)
Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future
by: Wu, Sihong, et al.
Published: (2026)
by: Wu, Sihong, et al.
Published: (2026)
LocAgent: Graph-Guided LLM Agents for Code Localization
by: Chen, Zhaoling, et al.
Published: (2025)
by: Chen, Zhaoling, et al.
Published: (2025)
MIR: Methodology Inspiration Retrieval for Scientific Research Problems
by: Garikaparthi, Aniketh, et al.
Published: (2025)
by: Garikaparthi, Aniketh, et al.
Published: (2025)
ReIFE: Re-evaluating Instruction-Following Evaluation
by: Liu, Yixin, et al.
Published: (2024)
by: Liu, Yixin, et al.
Published: (2024)
MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain Expertise
by: Deng, Chunyuan, et al.
Published: (2024)
by: Deng, Chunyuan, et al.
Published: (2024)
SciMDR: Advancing Scientific Multimodal Document Reasoning
by: Chen, Ziyu, et al.
Published: (2026)
by: Chen, Ziyu, et al.
Published: (2026)
RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation
by: Wu, Sihong, et al.
Published: (2026)
by: Wu, Sihong, et al.
Published: (2026)
ToolACE: Winning the Points of LLM Function Calling
by: Liu, Weiwen, et al.
Published: (2024)
by: Liu, Weiwen, et al.
Published: (2024)
Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
by: Liu, Yixin, et al.
Published: (2026)
by: Liu, Yixin, et al.
Published: (2026)
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
by: Zheng, Xiaosen, et al.
Published: (2024)
by: Zheng, Xiaosen, et al.
Published: (2024)
COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences
by: Liu, Yixin, et al.
Published: (2024)
by: Liu, Yixin, et al.
Published: (2024)
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2023)
by: Tang, Xiangru, et al.
Published: (2023)
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
by: Zhao, Haochen, et al.
Published: (2024)
by: Zhao, Haochen, et al.
Published: (2024)
MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments
by: Gao, Yicheng, et al.
Published: (2026)
by: Gao, Yicheng, et al.
Published: (2026)
Step-Back Profiling: Distilling User History for Personalized Scientific Writing
by: Tang, Xiangru, et al.
Published: (2024)
by: Tang, Xiangru, et al.
Published: (2024)
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
by: Long, Yitao, et al.
Published: (2025)
by: Long, Yitao, et al.
Published: (2025)
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
by: Shangguan, Ziyao, et al.
Published: (2024)
by: Shangguan, Ziyao, et al.
Published: (2024)
When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets
by: Weller, Orion, et al.
Published: (2023)
by: Weller, Orion, et al.
Published: (2023)
Graphical Reasoning: LLM-based Semi-Open Relation Extraction
by: Tao, Yicheng, et al.
Published: (2024)
by: Tao, Yicheng, et al.
Published: (2024)
MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants
by: Zhang, Zeyu, et al.
Published: (2024)
by: Zhang, Zeyu, et al.
Published: (2024)
Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge
by: Zhou, Xiaolin, et al.
Published: (2026)
by: Zhou, Xiaolin, et al.
Published: (2026)
P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains
by: Han, Simeng, et al.
Published: (2024)
by: Han, Simeng, et al.
Published: (2024)
GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration
by: Zhao, Junjie, et al.
Published: (2026)
by: Zhao, Junjie, et al.
Published: (2026)
Mini-Giants: "Small" Language Models and Open Source Win-Win
by: Zhou, Zhengping, et al.
Published: (2023)
by: Zhou, Zhengping, et al.
Published: (2023)
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
by: Dekoninck, Jasper, et al.
Published: (2024)
by: Dekoninck, Jasper, et al.
Published: (2024)
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation
by: Liu, Xiaoze, et al.
Published: (2024)
by: Liu, Xiaoze, et al.
Published: (2024)
When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training
by: Chen, Sanxing, et al.
Published: (2025)
by: Chen, Sanxing, et al.
Published: (2025)
Similar Items
-
On Evaluating LLM Alignment by Evaluating LLMs as Judges
by: Liu, Yixin, et al.
Published: (2025) -
Calibrating Long-form Generations from Large Language Models
by: Huang, Yukun, et al.
Published: (2024) -
On the Benefits of Fine-Grained Loss Truncation: A Case Study on Factuality in Summarization
by: Flores, Lorenzo Jaime Yu, et al.
Published: (2024) -
Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
by: Lee, Jinu, et al.
Published: (2025) -
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
by: Gao, Mingqi, et al.
Published: (2024)