Saved in:
| Main Authors: | Metcalf, Sara, Schoenberg, William |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.28994 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
How Well Can AI Build SD Models?
by: Schoenberg, William, et al.
Published: (2025)
by: Schoenberg, William, et al.
Published: (2025)
The Qualitative Engine: Creating and Evaluating an Iterative AI Modeling Tool
by: William Schoenberg, et al.
Published: (2026)
by: William Schoenberg, et al.
Published: (2026)
BEExAI: Benchmark to Evaluate Explainable AI
by: Sithakoul, Samuel, et al.
Published: (2024)
by: Sithakoul, Samuel, et al.
Published: (2024)
Building and Learning With Models Using AI
by: William Schoenberg
Published: (2026)
by: William Schoenberg
Published: (2026)
MDGYM: Benchmarking AI Agents on Molecular Simulations
by: Kumar, Vinay, et al.
Published: (2026)
by: Kumar, Vinay, et al.
Published: (2026)
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations
by: Kirch, Nathalie Maria, et al.
Published: (2024)
by: Kirch, Nathalie Maria, et al.
Published: (2024)
Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines
by: Gulati, Akshay, et al.
Published: (2026)
by: Gulati, Akshay, et al.
Published: (2026)
AI Benchmarks and Datasets for LLM Evaluation
by: Ivanov, Todor, et al.
Published: (2024)
by: Ivanov, Todor, et al.
Published: (2024)
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
by: Glazer, Elliot, et al.
Published: (2024)
by: Glazer, Elliot, et al.
Published: (2024)
PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories
by: Aroca-Ouellette, Stephane, et al.
Published: (2024)
by: Aroca-Ouellette, Stephane, et al.
Published: (2024)
AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems
by: Davvetas, Athanasios, et al.
Published: (2026)
by: Davvetas, Athanasios, et al.
Published: (2026)
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
by: Liu, Yi, et al.
Published: (2026)
by: Liu, Yi, et al.
Published: (2026)
Evaluation of Agents under Simulated AI Marketplace Dynamics
by: Kim, To Eun, et al.
Published: (2026)
by: Kim, To Eun, et al.
Published: (2026)
Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows
by: Dalal, Dhairya, et al.
Published: (2026)
by: Dalal, Dhairya, et al.
Published: (2026)
Enterprise Large Language Model Evaluation Benchmark
by: Wang, Liya, et al.
Published: (2025)
by: Wang, Liya, et al.
Published: (2025)
A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
by: Li, Miles Q., et al.
Published: (2025)
by: Li, Miles Q., et al.
Published: (2025)
Hindsight PRIORs for Reward Learning from Human Preferences
by: Verma, Mudit, et al.
Published: (2024)
by: Verma, Mudit, et al.
Published: (2024)
Unsteady Metrics and Benchmarking Cultures of AI Model Builders
by: Baack, Stefan, et al.
Published: (2026)
by: Baack, Stefan, et al.
Published: (2026)
AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models
by: Yao, Fanglong, et al.
Published: (2024)
by: Yao, Fanglong, et al.
Published: (2024)
AI Playing Business Games: Benchmarking Large Language Models on Managerial Decision-Making in Dynamic Simulations
by: Ovezmyradov, Berdymyrat
Published: (2025)
by: Ovezmyradov, Berdymyrat
Published: (2025)
Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop
by: Fahsbender, Elizabeth, et al.
Published: (2025)
by: Fahsbender, Elizabeth, et al.
Published: (2025)
LegalScore: Development of a Benchmark for Evaluating AI Models in Legal Career Exams in Brazil
by: Caparroz, Roberto, et al.
Published: (2025)
by: Caparroz, Roberto, et al.
Published: (2025)
The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition
by: Rajkomar, Alvin, et al.
Published: (2026)
by: Rajkomar, Alvin, et al.
Published: (2026)
SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems
by: Yang, Zonglin, et al.
Published: (2026)
by: Yang, Zonglin, et al.
Published: (2026)
Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning
by: Li, Fangjun, et al.
Published: (2024)
by: Li, Fangjun, et al.
Published: (2024)
Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models II: Benchmark Generation Process
by: Ackerman, Gary, et al.
Published: (2025)
by: Ackerman, Gary, et al.
Published: (2025)
MedLoRD: A Medical Low-Resource Diffusion Model for High-Resolution 3D CT Image Synthesis
by: Seyfarth, Marvin, et al.
Published: (2025)
by: Seyfarth, Marvin, et al.
Published: (2025)
ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines
by: Jin, Tengjun, et al.
Published: (2025)
by: Jin, Tengjun, et al.
Published: (2025)
Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents
by: Naser, MZ, et al.
Published: (2026)
by: Naser, MZ, et al.
Published: (2026)
Efficient Benchmarking of AI Agents
by: Ndzomga, Franck
Published: (2026)
by: Ndzomga, Franck
Published: (2026)
AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance
by: Marino, Bill, et al.
Published: (2025)
by: Marino, Bill, et al.
Published: (2025)
Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning
by: Yuan, Bo, et al.
Published: (2025)
by: Yuan, Bo, et al.
Published: (2025)
Generating Benchmarks for Factuality Evaluation of Language Models
by: Muhlgay, Dor, et al.
Published: (2023)
by: Muhlgay, Dor, et al.
Published: (2023)
Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation
by: Wu, Zonghan, et al.
Published: (2025)
by: Wu, Zonghan, et al.
Published: (2025)
FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
by: May, Victor, et al.
Published: (2025)
by: May, Victor, et al.
Published: (2025)
Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation
by: Lee, Huije, et al.
Published: (2026)
by: Lee, Huije, et al.
Published: (2026)
Evaluating AI Alignment in LLMs: Output Analysis of Value Priorities Across 75 Models with Human Benchmarking
by: Lau, Gabriel Rongyang, et al.
Published: (2025)
by: Lau, Gabriel Rongyang, et al.
Published: (2025)
Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks
by: Wang, Jianghui, et al.
Published: (2025)
by: Wang, Jianghui, et al.
Published: (2025)
AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences
by: Li, Jieyu, et al.
Published: (2025)
by: Li, Jieyu, et al.
Published: (2025)
A Survey on Multimodal Benchmarks: In the Era of Large AI Models
by: Li, Lin, et al.
Published: (2024)
by: Li, Lin, et al.
Published: (2024)
Similar Items
-
How Well Can AI Build SD Models?
by: Schoenberg, William, et al.
Published: (2025) -
The Qualitative Engine: Creating and Evaluating an Iterative AI Modeling Tool
by: William Schoenberg, et al.
Published: (2026) -
BEExAI: Benchmark to Evaluate Explainable AI
by: Sithakoul, Samuel, et al.
Published: (2024) -
Building and Learning With Models Using AI
by: William Schoenberg
Published: (2026) -
MDGYM: Benchmarking AI Agents on Molecular Simulations
by: Kumar, Vinay, et al.
Published: (2026)