Saved in:
| Main Authors: | Shepard, Daniel, Salimans, Robin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.18934 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Multistep Distillation of Diffusion Models via Moment Matching
by: Salimans, Tim, et al.
Published: (2024)
by: Salimans, Tim, et al.
Published: (2024)
Automating Document Intelligence in Statutory City Planning
by: Malmqvist, Lars, et al.
Published: (2026)
by: Malmqvist, Lars, et al.
Published: (2026)
Adopting Large Language Models to Automated System Integration
by: Pesl, Robin D.
Published: (2025)
by: Pesl, Robin D.
Published: (2025)
MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring
by: Jotautaitė, Monika, et al.
Published: (2026)
by: Jotautaitė, Monika, et al.
Published: (2026)
TaskBench: Benchmarking Large Language Models for Task Automation
by: Shen, Yongliang, et al.
Published: (2023)
by: Shen, Yongliang, et al.
Published: (2023)
Bench4KE: Benchmarking Automated Competency Question Generation
by: Lippolis, Anna Sofia, et al.
Published: (2025)
by: Lippolis, Anna Sofia, et al.
Published: (2025)
LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation
by: Yan, Zhiling, et al.
Published: (2026)
by: Yan, Zhiling, et al.
Published: (2026)
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting
by: Wang, Zhensheng, et al.
Published: (2026)
by: Wang, Zhensheng, et al.
Published: (2026)
IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations
by: Fu, Deqing, et al.
Published: (2024)
by: Fu, Deqing, et al.
Published: (2024)
DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
by: Patel, Liana, et al.
Published: (2025)
by: Patel, Liana, et al.
Published: (2025)
EM Distillation for One-step Diffusion Models
by: Xie, Sirui, et al.
Published: (2024)
by: Xie, Sirui, et al.
Published: (2024)
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks)
by: Priyanshu, Aman, et al.
Published: (2024)
by: Priyanshu, Aman, et al.
Published: (2024)
Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions
by: Moore, Steven, et al.
Published: (2024)
by: Moore, Steven, et al.
Published: (2024)
SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation
by: Oliva, Gustavo A., et al.
Published: (2025)
by: Oliva, Gustavo A., et al.
Published: (2025)
AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance
by: Patel, Dhaval, et al.
Published: (2025)
by: Patel, Dhaval, et al.
Published: (2025)
PostTrainBench: Can LLM Agents Automate LLM Post-Training?
by: Rank, Ben, et al.
Published: (2026)
by: Rank, Ben, et al.
Published: (2026)
BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
by: Tu, Xinming, et al.
Published: (2026)
by: Tu, Xinming, et al.
Published: (2026)
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering
by: Li, Yinsheng, et al.
Published: (2025)
by: Li, Yinsheng, et al.
Published: (2025)
Counterfactual Reasoning in Automated Planning
by: Pozanco, Alberto, et al.
Published: (2026)
by: Pozanco, Alberto, et al.
Published: (2026)
RamanBench: A Large-Scale Benchmark for Machine Learning on Raman Spectroscopy
by: Koddenbrock, Mario, et al.
Published: (2026)
by: Koddenbrock, Mario, et al.
Published: (2026)
A Graph-Attentive LSTM Model for Malicious URL Detection
by: Hossain, Md. Ifthekhar, et al.
Published: (2025)
by: Hossain, Md. Ifthekhar, et al.
Published: (2025)
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
by: Chen, Haolin, et al.
Published: (2026)
by: Chen, Haolin, et al.
Published: (2026)
ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines
by: Jin, Tengjun, et al.
Published: (2025)
by: Jin, Tengjun, et al.
Published: (2025)
Deep Research Bench: Evaluating AI Web Research Agents
by: FutureSearch, et al.
Published: (2025)
by: FutureSearch, et al.
Published: (2025)
MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models
by: Kim, Hyunjun, et al.
Published: (2025)
by: Kim, Hyunjun, et al.
Published: (2025)
TaoBench: Do Automated Theorem Prover LLMs Generalize Beyond MathLib?
by: Taylor, Alexander K, et al.
Published: (2026)
by: Taylor, Alexander K, et al.
Published: (2026)
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
by: Kim, Yubin, et al.
Published: (2026)
by: Kim, Yubin, et al.
Published: (2026)
Framing AI System Benchmarking as a Learning Task: FlexBench and the Open MLPerf Dataset
by: Fursin, Grigori, et al.
Published: (2025)
by: Fursin, Grigori, et al.
Published: (2025)
WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
by: Yen, Thomson, et al.
Published: (2026)
by: Yen, Thomson, et al.
Published: (2026)
REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning
by: Taechoyotin, Pawin, et al.
Published: (2025)
by: Taechoyotin, Pawin, et al.
Published: (2025)
LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
by: Fein, Daniel, et al.
Published: (2025)
by: Fein, Daniel, et al.
Published: (2025)
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation
by: Zhao, Enyu, et al.
Published: (2025)
by: Zhao, Enyu, et al.
Published: (2025)
Riemann-Bench: A Benchmark for Moonshot Mathematics
by: Garre, Suhaas, et al.
Published: (2026)
by: Garre, Suhaas, et al.
Published: (2026)
JobBench: Aligning Agent Work With Human Will
by: Li, Yuetai, et al.
Published: (2026)
by: Li, Yuetai, et al.
Published: (2026)
Sales Research Agent and Sales Research Bench
by: Bhol, Deepanjan
Published: (2025)
by: Bhol, Deepanjan
Published: (2025)
AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies
by: Ye, Xiao, et al.
Published: (2024)
by: Ye, Xiao, et al.
Published: (2024)
Automating Thought of Search: A Journey Towards Soundness and Completeness
by: Cao, Daniel, et al.
Published: (2024)
by: Cao, Daniel, et al.
Published: (2024)
ReEfBench: Quantifying the Reasoning Efficiency of LLMs
by: Fu, Zhizhang, et al.
Published: (2026)
by: Fu, Zhizhang, et al.
Published: (2026)
FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights
by: Wang, Zhen, et al.
Published: (2026)
by: Wang, Zhen, et al.
Published: (2026)
ConvexBench: Can LLMs Recognize Convex Functions?
by: Liu, Yepeng, et al.
Published: (2026)
by: Liu, Yepeng, et al.
Published: (2026)
Similar Items
-
Multistep Distillation of Diffusion Models via Moment Matching
by: Salimans, Tim, et al.
Published: (2024) -
Automating Document Intelligence in Statutory City Planning
by: Malmqvist, Lars, et al.
Published: (2026) -
Adopting Large Language Models to Automated System Integration
by: Pesl, Robin D.
Published: (2025) -
MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring
by: Jotautaitė, Monika, et al.
Published: (2026) -
TaskBench: Benchmarking Large Language Models for Task Automation
by: Shen, Yongliang, et al.
Published: (2023)