:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gao, Yicheng, Xu, Gonghan, Wang, Zhe, Cohan, Arman
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2411.04424
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

On Evaluating LLM Alignment by Evaluating LLMs as Judges
by: Liu, Yixin, et al.
Published: (2025)

Calibrating Long-form Generations from Large Language Models
by: Huang, Yukun, et al.
Published: (2024)

On the Benefits of Fine-Grained Loss Truncation: A Case Study on Factuality in Summarization
by: Flores, Lorenzo Jaime Yu, et al.
Published: (2024)

Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
by: Lee, Jinu, et al.
Published: (2025)

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
by: Gao, Mingqi, et al.
Published: (2024)

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
by: Zhao, Yilun, et al.
Published: (2025)

SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing
by: Liu, Hongjun, et al.
Published: (2025)

Survey on Evaluation of LLM-based Agents
by: Yehudai, Asaf, et al.
Published: (2025)

References Improve LLM Alignment in Non-Verifiable Domains
by: Shi, Kejian, et al.
Published: (2026)

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
by: Li, Chuhan, et al.
Published: (2024)

IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery
by: Garikaparthi, Aniketh, et al.
Published: (2025)

From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations
by: Wang, Benlu, et al.
Published: (2025)

Investigating Data Contamination in Modern Benchmarks for Large Language Models
by: Deng, Chunyuan, et al.
Published: (2023)

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future
by: Wu, Sihong, et al.
Published: (2026)

LocAgent: Graph-Guided LLM Agents for Code Localization
by: Chen, Zhaoling, et al.
Published: (2025)

MIR: Methodology Inspiration Retrieval for Scientific Research Problems
by: Garikaparthi, Aniketh, et al.
Published: (2025)

ReIFE: Re-evaluating Instruction-Following Evaluation
by: Liu, Yixin, et al.
Published: (2024)

MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain Expertise
by: Deng, Chunyuan, et al.
Published: (2024)

SciMDR: Advancing Scientific Multimodal Document Reasoning
by: Chen, Ziyu, et al.
Published: (2026)

RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation
by: Wu, Sihong, et al.
Published: (2026)

ToolACE: Winning the Points of LLM Function Calling
by: Liu, Weiwen, et al.
Published: (2024)

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
by: Liu, Yixin, et al.
Published: (2026)

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
by: Zheng, Xiaosen, et al.
Published: (2024)

COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences
by: Liu, Yixin, et al.
Published: (2024)

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2023)

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
by: Zhao, Haochen, et al.
Published: (2024)

MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments
by: Gao, Yicheng, et al.
Published: (2026)

Step-Back Profiling: Distilling User History for Personalized Scientific Writing
by: Tang, Xiangru, et al.
Published: (2024)

PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
by: Long, Yitao, et al.
Published: (2025)

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
by: Shangguan, Ziyao, et al.
Published: (2024)

When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets
by: Weller, Orion, et al.
Published: (2023)

Graphical Reasoning: LLM-based Semi-Open Relation Extraction
by: Tao, Yicheng, et al.
Published: (2024)

MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants
by: Zhang, Zeyu, et al.
Published: (2024)

Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge
by: Zhou, Xiaolin, et al.
Published: (2026)

P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains
by: Han, Simeng, et al.
Published: (2024)

GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration
by: Zhao, Junjie, et al.
Published: (2026)

Mini-Giants: "Small" Language Models and Open Source Win-Win
by: Zhou, Zhengping, et al.
Published: (2023)

Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
by: Dekoninck, Jasper, et al.
Published: (2024)

SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation
by: Liu, Xiaoze, et al.
Published: (2024)

When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training
by: Chen, Sanxing, et al.
Published: (2025)