Saved in:
| Main Author: | Ndzomga, Franck |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.23749 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
by: Kapoor, Sayash, et al.
Published: (2025)
by: Kapoor, Sayash, et al.
Published: (2025)
FORTIS: Benchmarking Over-Privilege in Agent Skills
by: Li, Shawn, et al.
Published: (2026)
by: Li, Shawn, et al.
Published: (2026)
MDGYM: Benchmarking AI Agents on Molecular Simulations
by: Kumar, Vinay, et al.
Published: (2026)
by: Kumar, Vinay, et al.
Published: (2026)
Anticipatory Planning for Multimodal AI Agents
by: Liang, Yongyuan, et al.
Published: (2026)
by: Liang, Yongyuan, et al.
Published: (2026)
Harnessing Language for Coordination: A Framework and Benchmark for LLM-Driven Multi-Agent Control
by: Anne, Timothée, et al.
Published: (2024)
by: Anne, Timothée, et al.
Published: (2024)
OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents
by: Davydova, Mariya, et al.
Published: (2025)
by: Davydova, Mariya, et al.
Published: (2025)
ART: Action-based Reasoning Task Benchmarking for Medical AI Agents
by: Mantravadi, Ananya, et al.
Published: (2026)
by: Mantravadi, Ananya, et al.
Published: (2026)
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
by: Li, Miles Q., et al.
Published: (2026)
by: Li, Miles Q., et al.
Published: (2026)
From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent
by: Wang, Yuhang, et al.
Published: (2026)
by: Wang, Yuhang, et al.
Published: (2026)
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
by: Xiong, Lei, et al.
Published: (2026)
by: Xiong, Lei, et al.
Published: (2026)
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents
by: Wei, Ziming, et al.
Published: (2025)
by: Wei, Ziming, et al.
Published: (2025)
A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
by: Li, Miles Q., et al.
Published: (2025)
by: Li, Miles Q., et al.
Published: (2025)
EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots
by: Lei, Zixing, et al.
Published: (2026)
by: Lei, Zixing, et al.
Published: (2026)
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
by: Jing, Lucas, et al.
Published: (2026)
by: Jing, Lucas, et al.
Published: (2026)
AgentSearchBench: A Benchmark for AI Agent Search in the Wild
by: Wu, Bin, et al.
Published: (2026)
by: Wu, Bin, et al.
Published: (2026)
SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
by: Clark, Jackson, et al.
Published: (2026)
by: Clark, Jackson, et al.
Published: (2026)
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
by: Wang, Luyuan, et al.
Published: (2024)
by: Wang, Luyuan, et al.
Published: (2024)
ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities
by: Zanoli, Christopher, et al.
Published: (2026)
by: Zanoli, Christopher, et al.
Published: (2026)
ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines
by: Jin, Tengjun, et al.
Published: (2025)
by: Jin, Tengjun, et al.
Published: (2025)
PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
by: Li, Yubo, et al.
Published: (2026)
by: Li, Yubo, et al.
Published: (2026)
Improvisational Games as a Benchmark for Social Intelligence of AI Agents: The Case of Connections
by: Parikh, Gaurav Rajesh, et al.
Published: (2026)
by: Parikh, Gaurav Rajesh, et al.
Published: (2026)
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite
by: Bragg, Jonathan, et al.
Published: (2025)
by: Bragg, Jonathan, et al.
Published: (2025)
FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
by: May, Victor, et al.
Published: (2025)
by: May, Victor, et al.
Published: (2025)
ClawArena: Benchmarking AI Agents in Evolving Information Environments
by: Ji, Haonian, et al.
Published: (2026)
by: Ji, Haonian, et al.
Published: (2026)
Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks
by: Wang, Jianghui, et al.
Published: (2025)
by: Wang, Jianghui, et al.
Published: (2025)
AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems
by: Ferrag, Mohamed Amine, et al.
Published: (2026)
by: Ferrag, Mohamed Amine, et al.
Published: (2026)
Benchmarking Agents in Insurance Underwriting Environments
by: Dsouza, Amanda, et al.
Published: (2026)
by: Dsouza, Amanda, et al.
Published: (2026)
ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development
by: Lu, Pengrui, et al.
Published: (2026)
by: Lu, Pengrui, et al.
Published: (2026)
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
by: He, Muyu, et al.
Published: (2026)
by: He, Muyu, et al.
Published: (2026)
BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science
by: Lin, Xinna, et al.
Published: (2024)
by: Lin, Xinna, et al.
Published: (2024)
MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark
by: Fan, Shiqing, et al.
Published: (2025)
by: Fan, Shiqing, et al.
Published: (2025)
NetArena: Dynamic Benchmarks for AI Agents in Network Automation
by: Zhou, Yajie, et al.
Published: (2025)
by: Zhou, Yajie, et al.
Published: (2025)
FinRetrieval: A Benchmark for Financial Data Retrieval by AI Agents
by: Kim, Eric Y., et al.
Published: (2026)
by: Kim, Eric Y., et al.
Published: (2026)
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
by: Nathani, Deepak, et al.
Published: (2025)
by: Nathani, Deepak, et al.
Published: (2025)
Governing AI Agents
by: Kolt, Noam
Published: (2025)
by: Kolt, Noam
Published: (2025)
Infrastructure for AI Agents
by: Chan, Alan, et al.
Published: (2025)
by: Chan, Alan, et al.
Published: (2025)
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
by: Dong, Haonan, et al.
Published: (2026)
by: Dong, Haonan, et al.
Published: (2026)
AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks
by: Jiang, Tanqiu, et al.
Published: (2026)
by: Jiang, Tanqiu, et al.
Published: (2026)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
by: Xu, Tianqi, et al.
Published: (2024)
by: Xu, Tianqi, et al.
Published: (2024)
ChartCitor: Multi-Agent Framework for Fine-Grained Chart Visual Attribution
by: Goswami, Kanika, et al.
Published: (2025)
by: Goswami, Kanika, et al.
Published: (2025)
Similar Items
-
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
by: Kapoor, Sayash, et al.
Published: (2025) -
FORTIS: Benchmarking Over-Privilege in Agent Skills
by: Li, Shawn, et al.
Published: (2026) -
MDGYM: Benchmarking AI Agents on Molecular Simulations
by: Kumar, Vinay, et al.
Published: (2026) -
Anticipatory Planning for Multimodal AI Agents
by: Liang, Yongyuan, et al.
Published: (2026) -
Harnessing Language for Coordination: A Framework and Benchmark for LLM-Driven Multi-Agent Control
by: Anne, Timothée, et al.
Published: (2024)