Saved in:
Bibliographic Details
Main Authors: Cui, Fan, Hou, Hongyuan, Luo, Zizhang, Yin, Chenyun, Liang, Yun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.14709
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915980662276096
author Cui, Fan
Hou, Hongyuan
Luo, Zizhang
Yin, Chenyun
Liang, Yun
author_facet Cui, Fan
Hou, Hongyuan
Luo, Zizhang
Yin, Chenyun
Liang, Yun
contents Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new repositories. We evaluate seven LLMs with four agent frameworks and find that the best agent resolves 70.7% of tasks overall, with performance exceeding 90% on smaller cores but dropping below 65% on complex SoC-level projects. We observe larger performance gaps across models than commonly reported on software benchmarks, and difficulty is driven by project scope and bug-type distribution rather than code size alone. Our failure analysis traces agent failures to three stages of the debugging process: fault localization, hardware-semantic reasoning, and cross-artifact coordination across RTL, configuration, and verification components, providing concrete directions for developing more capable hardware-aware agents.
format Preprint
id arxiv_https___arxiv_org_abs_2604_14709
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
Cui, Fan
Hou, Hongyuan
Luo, Zizhang
Yin, Chenyun
Liang, Yun
Artificial Intelligence
Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new repositories. We evaluate seven LLMs with four agent frameworks and find that the best agent resolves 70.7% of tasks overall, with performance exceeding 90% on smaller cores but dropping below 65% on complex SoC-level projects. We observe larger performance gaps across models than commonly reported on software benchmarks, and difficulty is driven by project scope and bug-type distribution rather than code size alone. Our failure analysis traces agent failures to three stages of the debugging process: fault localization, hardware-semantic reasoning, and cross-artifact coordination across RTL, configuration, and verification components, providing concrete directions for developing more capable hardware-aware agents.
title HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
topic Artificial Intelligence
url https://arxiv.org/abs/2604.14709