:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cui, Fan, Hou, Hongyuan, Luo, Zizhang, Yin, Chenyun, Liang, Yun
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.14709
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

R3A: Reliable RTL Repair Framework with Multi-Agent Fault Localization and Stochastic Tree-of-Thoughts Patch Generation
by: Luo, Zizhang, et al.
Published: (2025)

Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair
by: Luo, Zizhang, et al.
Published: (2026)

LLM-Guided Strategy Synthesis for Scalable Equality Saturation
by: Yin, Chenyun, et al.
Published: (2026)

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
by: Long, Xiang, et al.
Published: (2026)

PerfBench: Can Agents Resolve Real-World Performance Bugs?
by: Garg, Spandan, et al.
Published: (2025)

Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
by: Luo, Zizhang, et al.
Published: (2026)

DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows
by: Liu, Zhou, et al.
Published: (2025)

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents
by: Mündler, Niels, et al.
Published: (2024)

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
by: He, Wei, et al.
Published: (2025)

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
by: Chu, Zhaoyang, et al.
Published: (2026)

MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use
by: Liu, Wenrui, et al.
Published: (2025)

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
by: Yin, Sheng, et al.
Published: (2024)

DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis
by: Zhang, Qiaohong, et al.
Published: (2026)

GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging
by: Ni, Ziyi, et al.
Published: (2025)

BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software
by: Zhang, Zehua, et al.
Published: (2025)

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
by: Liu, Ruoqi, et al.
Published: (2026)

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol
by: Zhu, Jie, et al.
Published: (2026)

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
by: Liu, Zhiqiang, et al.
Published: (2026)

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
by: Lin, Edward, et al.
Published: (2026)

HumanStudy-Bench: Towards AI Agent Design for Participant Simulation
by: Liu, Xuan, et al.
Published: (2026)

NEWSAGENT: Benchmarking Multimodal Agents as Journalists with Real-World Newswriting Tasks
by: Chien, Yen-Che, et al.
Published: (2025)

$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking
by: Yu, Peijie, et al.
Published: (2025)

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
by: Hu, Lingxiang, et al.
Published: (2026)

FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use
by: Lu, Jiaxuan, et al.
Published: (2026)

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories
by: Liu, Yibing, et al.
Published: (2026)

DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios
by: Gao, Zeyu, et al.
Published: (2025)

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
by: Song, Zhiheng, et al.
Published: (2026)

LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges
by: Li, Hao, et al.
Published: (2026)

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
by: Li, Keyu, et al.
Published: (2026)

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios
by: Shen, Yuanzhe, et al.
Published: (2026)

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
by: Chi, Yizhe, et al.
Published: (2026)

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments
by: Yu, Yi, et al.
Published: (2026)

Integrating Various Software Artifacts for Better LLM-based Bug Localization and Program Repair
by: Feng, Qiong, et al.
Published: (2024)

WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
by: Yen, Thomson, et al.
Published: (2026)

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services
by: He, Hang, et al.
Published: (2025)

DeliveryBench: Can Agents Earn Profit in Real World?
by: Mao, Lingjun, et al.
Published: (2025)

SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories
by: Shen, Chihao, et al.
Published: (2025)

ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents
by: Shen, Haiyang, et al.
Published: (2024)

ComBench: A Repo-level Real-world Benchmark for Compilation Error Repair
by: Li, Jia, et al.
Published: (2026)

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes
by: Li, Kuan, et al.
Published: (2026)