:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Ndzomga, Franck
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.23749
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
by: Kapoor, Sayash, et al.
Published: (2025)

FORTIS: Benchmarking Over-Privilege in Agent Skills
by: Li, Shawn, et al.
Published: (2026)

MDGYM: Benchmarking AI Agents on Molecular Simulations
by: Kumar, Vinay, et al.
Published: (2026)

Anticipatory Planning for Multimodal AI Agents
by: Liang, Yongyuan, et al.
Published: (2026)

Harnessing Language for Coordination: A Framework and Benchmark for LLM-Driven Multi-Agent Control
by: Anne, Timothée, et al.
Published: (2024)

OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents
by: Davydova, Mariya, et al.
Published: (2025)

ART: Action-based Reasoning Task Benchmarking for Medical AI Agents
by: Mantravadi, Ananya, et al.
Published: (2026)

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
by: Li, Miles Q., et al.
Published: (2026)

From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent
by: Wang, Yuhang, et al.
Published: (2026)

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
by: Xiong, Lei, et al.
Published: (2026)

MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents
by: Wei, Ziming, et al.
Published: (2025)

A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
by: Li, Miles Q., et al.
Published: (2025)

EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots
by: Lei, Zixing, et al.
Published: (2026)

PBT-Bench: Benchmarking AI Agents on Property-Based Testing
by: Jing, Lucas, et al.
Published: (2026)

AgentSearchBench: A Benchmark for AI Agent Search in the Wild
by: Wu, Bin, et al.
Published: (2026)

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
by: Clark, Jackson, et al.
Published: (2026)

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
by: Wang, Luyuan, et al.
Published: (2024)

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities
by: Zanoli, Christopher, et al.
Published: (2026)

ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines
by: Jin, Tengjun, et al.
Published: (2025)

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
by: Li, Yubo, et al.
Published: (2026)

Improvisational Games as a Benchmark for Social Intelligence of AI Agents: The Case of Connections
by: Parikh, Gaurav Rajesh, et al.
Published: (2026)

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite
by: Bragg, Jonathan, et al.
Published: (2025)

FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
by: May, Victor, et al.
Published: (2025)

ClawArena: Benchmarking AI Agents in Evolving Information Environments
by: Ji, Haonian, et al.
Published: (2026)

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks
by: Wang, Jianghui, et al.
Published: (2025)

AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems
by: Ferrag, Mohamed Amine, et al.
Published: (2026)

Benchmarking Agents in Insurance Underwriting Environments
by: Dsouza, Amanda, et al.
Published: (2026)

ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development
by: Lu, Pengrui, et al.
Published: (2026)

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
by: He, Muyu, et al.
Published: (2026)

BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science
by: Lin, Xinna, et al.
Published: (2024)

MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark
by: Fan, Shiqing, et al.
Published: (2025)

NetArena: Dynamic Benchmarks for AI Agents in Network Automation
by: Zhou, Yajie, et al.
Published: (2025)

FinRetrieval: A Benchmark for Financial Data Retrieval by AI Agents
by: Kim, Eric Y., et al.
Published: (2026)

MLGym: A New Framework and Benchmark for Advancing AI Research Agents
by: Nathani, Deepak, et al.
Published: (2025)

Governing AI Agents
by: Kolt, Noam
Published: (2025)

Infrastructure for AI Agents
by: Chan, Alan, et al.
Published: (2025)

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
by: Dong, Haonan, et al.
Published: (2026)

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks
by: Jiang, Tanqiu, et al.
Published: (2026)

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
by: Xu, Tianqi, et al.
Published: (2024)

ChartCitor: Multi-Agent Framework for Fine-Grained Chart Visual Attribution
by: Goswami, Kanika, et al.
Published: (2025)