Saved in:
Bibliographic Details
Main Authors: Twabi, Ahmed, Ding, Yepeng, Kondo, Tohru
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.09678
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915931416952832
author Twabi, Ahmed
Ding, Yepeng
Kondo, Tohru
author_facet Twabi, Ahmed
Ding, Yepeng
Kondo, Tohru
contents As agentic network management gains popularity, there is a critical need for evaluation frameworks that transcend static, one-shot testing. To address this, we introduce NetAgentBench, a dynamic benchmark that evaluates agent interactions through a Finite State Machine (FSM) formalization guaranteeing determinism, correctness, and bounded execution. This provides the networking landscape with a rigorous foundation to measure complex, multi-turn operational behaviors. Our empirical evaluation of four state-of-the-art LLM agents through diverse network configuration tasks reveals stark deficiencies: while agents can solve basic tasks, they suffer severe exploration meltdowns and coherence collapse during expert-level configurations. Ultimately, NetAgentBench demonstrates that systematically evaluating multi-turn behavioral stability is an indispensable step toward realizing trustworthy, fully autonomous networks.
format Preprint
id arxiv_https___arxiv_org_abs_2604_09678
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle NetAgentBench: A State-Centric Benchmark for Evaluating Agentic Network Configuration
Twabi, Ahmed
Ding, Yepeng
Kondo, Tohru
Networking and Internet Architecture
Artificial Intelligence
Formal Languages and Automata Theory
As agentic network management gains popularity, there is a critical need for evaluation frameworks that transcend static, one-shot testing. To address this, we introduce NetAgentBench, a dynamic benchmark that evaluates agent interactions through a Finite State Machine (FSM) formalization guaranteeing determinism, correctness, and bounded execution. This provides the networking landscape with a rigorous foundation to measure complex, multi-turn operational behaviors. Our empirical evaluation of four state-of-the-art LLM agents through diverse network configuration tasks reveals stark deficiencies: while agents can solve basic tasks, they suffer severe exploration meltdowns and coherence collapse during expert-level configurations. Ultimately, NetAgentBench demonstrates that systematically evaluating multi-turn behavioral stability is an indispensable step toward realizing trustworthy, fully autonomous networks.
title NetAgentBench: A State-Centric Benchmark for Evaluating Agentic Network Configuration
topic Networking and Internet Architecture
Artificial Intelligence
Formal Languages and Automata Theory
url https://arxiv.org/abs/2604.09678