Saved in:
Bibliographic Details
Main Authors: Rani, Nanda, Milner, Kimberly, Shao, Minghao, Udeshi, Meet, Xi, Haoran, Putrevu, Venkata Sai Charan, Aggarwal, Saksham, Shukla, Sandeep K., Krishnamurthy, Prashanth, Khorrami, Farshad, Shafique, Muhammad, Karri, Ramesh
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.08023
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Existing benchmarks for LLM-based offensive security agents use isolated, single-target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture-the-Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we introduce \textit{CTFExplorer}, a benchmark suite that shifts offensive security evaluation toward a multi-target setting, which tests how agents explore, prioritize, and chain attacks. CTFExplorer deploys 40 web-based vulnerable services within a single environment, where agents must autonomously discover, distinguish, and exploit targets without predefined guidance. We also present a reactive multi-agent setup as a reference agent framework and develop an agent-agnostic evaluation framework that records structured reasoning traces for fine-grained assessment. This enables behavioral evaluation beyond binary flag capture, such as how agents manage target selection, handle failed hypotheses, coordinate across multiple stages, and extract security intelligence.