Saved in:
| Main Authors: | Cui, Fan, Hou, Hongyuan, Luo, Zizhang, Yin, Chenyun, Liang, Yun |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.14709 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
R3A: Reliable RTL Repair Framework with Multi-Agent Fault Localization and Stochastic Tree-of-Thoughts Patch Generation
by: Luo, Zizhang, et al.
Published: (2025)
by: Luo, Zizhang, et al.
Published: (2025)
Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair
by: Luo, Zizhang, et al.
Published: (2026)
by: Luo, Zizhang, et al.
Published: (2026)
LLM-Guided Strategy Synthesis for Scalable Equality Saturation
by: Yin, Chenyun, et al.
Published: (2026)
by: Yin, Chenyun, et al.
Published: (2026)
LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
by: Long, Xiang, et al.
Published: (2026)
by: Long, Xiang, et al.
Published: (2026)
PerfBench: Can Agents Resolve Real-World Performance Bugs?
by: Garg, Spandan, et al.
Published: (2025)
by: Garg, Spandan, et al.
Published: (2025)
Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
by: Luo, Zizhang, et al.
Published: (2026)
by: Luo, Zizhang, et al.
Published: (2026)
DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows
by: Liu, Zhou, et al.
Published: (2025)
by: Liu, Zhou, et al.
Published: (2025)
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents
by: Mündler, Niels, et al.
Published: (2024)
by: Mündler, Niels, et al.
Published: (2024)
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
by: He, Wei, et al.
Published: (2025)
by: He, Wei, et al.
Published: (2025)
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
by: Chu, Zhaoyang, et al.
Published: (2026)
by: Chu, Zhaoyang, et al.
Published: (2026)
MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use
by: Liu, Wenrui, et al.
Published: (2025)
by: Liu, Wenrui, et al.
Published: (2025)
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
by: Yin, Sheng, et al.
Published: (2024)
by: Yin, Sheng, et al.
Published: (2024)
DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis
by: Zhang, Qiaohong, et al.
Published: (2026)
by: Zhang, Qiaohong, et al.
Published: (2026)
GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging
by: Ni, Ziyi, et al.
Published: (2025)
by: Ni, Ziyi, et al.
Published: (2025)
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software
by: Zhang, Zehua, et al.
Published: (2025)
by: Zhang, Zehua, et al.
Published: (2025)
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
by: Liu, Ruoqi, et al.
Published: (2026)
by: Liu, Ruoqi, et al.
Published: (2026)
FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol
by: Zhu, Jie, et al.
Published: (2026)
by: Zhu, Jie, et al.
Published: (2026)
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
by: Liu, Zhiqiang, et al.
Published: (2026)
by: Liu, Zhiqiang, et al.
Published: (2026)
SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
by: Lin, Edward, et al.
Published: (2026)
by: Lin, Edward, et al.
Published: (2026)
HumanStudy-Bench: Towards AI Agent Design for Participant Simulation
by: Liu, Xuan, et al.
Published: (2026)
by: Liu, Xuan, et al.
Published: (2026)
NEWSAGENT: Benchmarking Multimodal Agents as Journalists with Real-World Newswriting Tasks
by: Chien, Yen-Che, et al.
Published: (2025)
by: Chien, Yen-Che, et al.
Published: (2025)
$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking
by: Yu, Peijie, et al.
Published: (2025)
by: Yu, Peijie, et al.
Published: (2025)
AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
by: Hu, Lingxiang, et al.
Published: (2026)
by: Hu, Lingxiang, et al.
Published: (2026)
FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use
by: Lu, Jiaxuan, et al.
Published: (2026)
by: Lu, Jiaxuan, et al.
Published: (2026)
OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories
by: Liu, Yibing, et al.
Published: (2026)
by: Liu, Yibing, et al.
Published: (2026)
DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios
by: Gao, Zeyu, et al.
Published: (2025)
by: Gao, Zeyu, et al.
Published: (2025)
MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
by: Song, Zhiheng, et al.
Published: (2026)
by: Song, Zhiheng, et al.
Published: (2026)
LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges
by: Li, Hao, et al.
Published: (2026)
by: Li, Hao, et al.
Published: (2026)
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
by: Li, Keyu, et al.
Published: (2026)
by: Li, Keyu, et al.
Published: (2026)
TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios
by: Shen, Yuanzhe, et al.
Published: (2026)
by: Shen, Yuanzhe, et al.
Published: (2026)
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
by: Chi, Yizhe, et al.
Published: (2026)
by: Chi, Yizhe, et al.
Published: (2026)
CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments
by: Yu, Yi, et al.
Published: (2026)
by: Yu, Yi, et al.
Published: (2026)
Integrating Various Software Artifacts for Better LLM-based Bug Localization and Program Repair
by: Feng, Qiong, et al.
Published: (2024)
by: Feng, Qiong, et al.
Published: (2024)
WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
by: Yen, Thomson, et al.
Published: (2026)
by: Yen, Thomson, et al.
Published: (2026)
LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services
by: He, Hang, et al.
Published: (2025)
by: He, Hang, et al.
Published: (2025)
DeliveryBench: Can Agents Earn Profit in Real World?
by: Mao, Lingjun, et al.
Published: (2025)
by: Mao, Lingjun, et al.
Published: (2025)
SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories
by: Shen, Chihao, et al.
Published: (2025)
by: Shen, Chihao, et al.
Published: (2025)
ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents
by: Shen, Haiyang, et al.
Published: (2024)
by: Shen, Haiyang, et al.
Published: (2024)
ComBench: A Repo-level Real-world Benchmark for Compilation Error Repair
by: Li, Jia, et al.
Published: (2026)
by: Li, Jia, et al.
Published: (2026)
SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes
by: Li, Kuan, et al.
Published: (2026)
by: Li, Kuan, et al.
Published: (2026)
Similar Items
-
R3A: Reliable RTL Repair Framework with Multi-Agent Fault Localization and Stochastic Tree-of-Thoughts Patch Generation
by: Luo, Zizhang, et al.
Published: (2025) -
Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair
by: Luo, Zizhang, et al.
Published: (2026) -
LLM-Guided Strategy Synthesis for Scalable Equality Saturation
by: Yin, Chenyun, et al.
Published: (2026) -
LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
by: Long, Xiang, et al.
Published: (2026) -
PerfBench: Can Agents Resolve Real-World Performance Bugs?
by: Garg, Spandan, et al.
Published: (2025)