Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Rani, Nanda, Milner, Kimberly, Shao, Minghao, Udeshi, Meet, Xi, Haoran, Putrevu, Venkata Sai Charan, Aggarwal, Saksham, Shukla, Sandeep K., Krishnamurthy, Prashanth, Khorrami, Farshad, Shafique, Muhammad, Karri, Ramesh
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security Artificial Intelligence Multiagent Systems
Online Access:	https://arxiv.org/abs/2602.08023
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Existing benchmarks for LLM-based offensive security agents use isolated, single-target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture-the-Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we introduce \textit{CTFExplorer}, a benchmark suite that shifts offensive security evaluation toward a multi-target setting, which tests how agents explore, prioritize, and chain attacks. CTFExplorer deploys 40 web-based vulnerable services within a single environment, where agents must autonomously discover, distinguish, and exploit targets without predefined guidance. We also present a reactive multi-agent setup as a reference agent framework and develop an agent-agnostic evaluation framework that records structured reasoning traces for fine-grained assessment. This enables behavioral evaluation beyond binary flag capture, such as how agents manage target selection, handle failed hypotheses, coordinate across multiple stages, and extract security intelligence.

Similar Items