:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Shepard, Daniel, Salimans, Robin
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.18934
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Multistep Distillation of Diffusion Models via Moment Matching
by: Salimans, Tim, et al.
Published: (2024)

Automating Document Intelligence in Statutory City Planning
by: Malmqvist, Lars, et al.
Published: (2026)

Adopting Large Language Models to Automated System Integration
by: Pesl, Robin D.
Published: (2025)

MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring
by: Jotautaitė, Monika, et al.
Published: (2026)

TaskBench: Benchmarking Large Language Models for Task Automation
by: Shen, Yongliang, et al.
Published: (2023)

Bench4KE: Benchmarking Automated Competency Question Generation
by: Lippolis, Anna Sofia, et al.
Published: (2025)

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation
by: Yan, Zhiling, et al.
Published: (2026)

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting
by: Wang, Zhensheng, et al.
Published: (2026)

IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations
by: Fu, Deqing, et al.
Published: (2024)

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
by: Patel, Liana, et al.
Published: (2025)

EM Distillation for One-step Diffusion Models
by: Xie, Sirui, et al.
Published: (2024)

FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks)
by: Priyanshu, Aman, et al.
Published: (2024)

Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions
by: Moore, Steven, et al.
Published: (2024)

SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation
by: Oliva, Gustavo A., et al.
Published: (2025)

AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance
by: Patel, Dhaval, et al.
Published: (2025)

PostTrainBench: Can LLM Agents Automate LLM Post-Training?
by: Rank, Ben, et al.
Published: (2026)

BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
by: Tu, Xinming, et al.
Published: (2026)

DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering
by: Li, Yinsheng, et al.
Published: (2025)

Counterfactual Reasoning in Automated Planning
by: Pozanco, Alberto, et al.
Published: (2026)

RamanBench: A Large-Scale Benchmark for Machine Learning on Raman Spectroscopy
by: Koddenbrock, Mario, et al.
Published: (2026)

A Graph-Attentive LSTM Model for Malicious URL Detection
by: Hossain, Md. Ifthekhar, et al.
Published: (2025)

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
by: Chen, Haolin, et al.
Published: (2026)

ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines
by: Jin, Tengjun, et al.
Published: (2025)

Deep Research Bench: Evaluating AI Web Research Agents
by: FutureSearch, et al.
Published: (2025)

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models
by: Kim, Hyunjun, et al.
Published: (2025)

TaoBench: Do Automated Theorem Prover LLMs Generalize Beyond MathLib?
by: Taylor, Alexander K, et al.
Published: (2026)

TeamBench: Evaluating Agent Coordination under Enforced Role Separation
by: Kim, Yubin, et al.
Published: (2026)

Framing AI System Benchmarking as a Learning Task: FlexBench and the Open MLPerf Dataset
by: Fursin, Grigori, et al.
Published: (2025)

WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
by: Yen, Thomson, et al.
Published: (2026)

REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning
by: Taechoyotin, Pawin, et al.
Published: (2025)

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
by: Fein, Daniel, et al.
Published: (2025)

ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation
by: Zhao, Enyu, et al.
Published: (2025)

Riemann-Bench: A Benchmark for Moonshot Mathematics
by: Garre, Suhaas, et al.
Published: (2026)

JobBench: Aligning Agent Work With Human Will
by: Li, Yuetai, et al.
Published: (2026)

Sales Research Agent and Sales Research Bench
by: Bhol, Deepanjan
Published: (2025)

AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies
by: Ye, Xiao, et al.
Published: (2024)

Automating Thought of Search: A Journey Towards Soundness and Completeness
by: Cao, Daniel, et al.
Published: (2024)

ReEfBench: Quantifying the Reasoning Efficiency of LLMs
by: Fu, Zhizhang, et al.
Published: (2026)

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights
by: Wang, Zhen, et al.
Published: (2026)

ConvexBench: Can LLMs Recognize Convex Functions?
by: Liu, Yepeng, et al.
Published: (2026)