Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.09678 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915931416952832 |
|---|---|
| author | Twabi, Ahmed Ding, Yepeng Kondo, Tohru |
| author_facet | Twabi, Ahmed Ding, Yepeng Kondo, Tohru |
| contents | As agentic network management gains popularity, there is a critical need for evaluation frameworks that transcend static, one-shot testing. To address this, we introduce NetAgentBench, a dynamic benchmark that evaluates agent interactions through a Finite State Machine (FSM) formalization guaranteeing determinism, correctness, and bounded execution. This provides the networking landscape with a rigorous foundation to measure complex, multi-turn operational behaviors. Our empirical evaluation of four state-of-the-art LLM agents through diverse network configuration tasks reveals stark deficiencies: while agents can solve basic tasks, they suffer severe exploration meltdowns and coherence collapse during expert-level configurations. Ultimately, NetAgentBench demonstrates that systematically evaluating multi-turn behavioral stability is an indispensable step toward realizing trustworthy, fully autonomous networks. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_09678 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | NetAgentBench: A State-Centric Benchmark for Evaluating Agentic Network Configuration Twabi, Ahmed Ding, Yepeng Kondo, Tohru Networking and Internet Architecture Artificial Intelligence Formal Languages and Automata Theory As agentic network management gains popularity, there is a critical need for evaluation frameworks that transcend static, one-shot testing. To address this, we introduce NetAgentBench, a dynamic benchmark that evaluates agent interactions through a Finite State Machine (FSM) formalization guaranteeing determinism, correctness, and bounded execution. This provides the networking landscape with a rigorous foundation to measure complex, multi-turn operational behaviors. Our empirical evaluation of four state-of-the-art LLM agents through diverse network configuration tasks reveals stark deficiencies: while agents can solve basic tasks, they suffer severe exploration meltdowns and coherence collapse during expert-level configurations. Ultimately, NetAgentBench demonstrates that systematically evaluating multi-turn behavioral stability is an indispensable step toward realizing trustworthy, fully autonomous networks. |
| title | NetAgentBench: A State-Centric Benchmark for Evaluating Agentic Network Configuration |
| topic | Networking and Internet Architecture Artificial Intelligence Formal Languages and Automata Theory |
| url | https://arxiv.org/abs/2604.09678 |