Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Debnath, Rubi, Mateu, Daniel Bujosa, Zhao, Luxi, Craciunas, Silviu S., Pop, Paul, Steinhorst, Sebastian
Format:	Preprint
Published:	2026
Subjects:	Networking and Internet Architecture
Online Access:	https://arxiv.org/abs/2605.09481
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917478125273088
author	Debnath, Rubi Mateu, Daniel Bujosa Zhao, Luxi Craciunas, Silviu S. Pop, Paul Steinhorst, Sebastian
author_facet	Debnath, Rubi Mateu, Daniel Bujosa Zhao, Luxi Craciunas, Silviu S. Pop, Paul Steinhorst, Sebastian
contents	We present TSNBench, the first benchmark for evaluating large language model (LLM) proficiency in Time-Sensitive Networking (TSN), a suite of IEEE 802.1 standards for deterministic communication with bounded latency in safety-critical domains such as autonomous vehicles, aviation, defense, and industrial automation. While LLMs have been extensively evaluated on general knowledge tasks, their capabilities in safety-critical networking domains remain largely unexplored. TSNBench comprises 939 expert-validated multiple-choice questions (MCQs) covering diverse TSN mechanisms, along with 100 open-ended Worst-Case Delay (WCD) computation tasks for Credit-Based Shaper (CBS) and Cyclic Queuing and Forwarding (CQF) across varying network topologies and traffic conditions. MCQ answers are validated by domain experts, and open-ended ground truth WCD values are computed using a verified Network Calculus (NC) solver for CBS and closed-form mathematical upper bounds for CQF. We evaluate 16 LLMs and find that although models achieve 67 to 95% accuracy on MCQs, they fail substantially on open-ended WCD computation. For CBS, only GPT-5 achieves a Mean Absolute Percentage Error (MAPE) of 36.2%, meaning its predicted WCD deviates by 36.2% of the actual TSN flow delay on average, while most models exceed 80%. For CQF, the best model achieves 41.8% MAPE, with most models clustering between 80% and 100%. Such errors are large relative to TSN latency budgets and can lead to violations of real-time constraints and unsafe configurations. TSNBench demonstrates that MCQ benchmarks may overestimate LLM capabilities in safety-critical networking domains.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_09481
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	TSNBench: Benchmarking LLM Proficiency in Time-Sensitive Networking Debnath, Rubi Mateu, Daniel Bujosa Zhao, Luxi Craciunas, Silviu S. Pop, Paul Steinhorst, Sebastian Networking and Internet Architecture We present TSNBench, the first benchmark for evaluating large language model (LLM) proficiency in Time-Sensitive Networking (TSN), a suite of IEEE 802.1 standards for deterministic communication with bounded latency in safety-critical domains such as autonomous vehicles, aviation, defense, and industrial automation. While LLMs have been extensively evaluated on general knowledge tasks, their capabilities in safety-critical networking domains remain largely unexplored. TSNBench comprises 939 expert-validated multiple-choice questions (MCQs) covering diverse TSN mechanisms, along with 100 open-ended Worst-Case Delay (WCD) computation tasks for Credit-Based Shaper (CBS) and Cyclic Queuing and Forwarding (CQF) across varying network topologies and traffic conditions. MCQ answers are validated by domain experts, and open-ended ground truth WCD values are computed using a verified Network Calculus (NC) solver for CBS and closed-form mathematical upper bounds for CQF. We evaluate 16 LLMs and find that although models achieve 67 to 95% accuracy on MCQs, they fail substantially on open-ended WCD computation. For CBS, only GPT-5 achieves a Mean Absolute Percentage Error (MAPE) of 36.2%, meaning its predicted WCD deviates by 36.2% of the actual TSN flow delay on average, while most models exceed 80%. For CQF, the best model achieves 41.8% MAPE, with most models clustering between 80% and 100%. Such errors are large relative to TSN latency budgets and can lead to violations of real-time constraints and unsafe configurations. TSNBench demonstrates that MCQ benchmarks may overestimate LLM capabilities in safety-critical networking domains.
title	TSNBench: Benchmarking LLM Proficiency in Time-Sensitive Networking
topic	Networking and Internet Architecture
url	https://arxiv.org/abs/2605.09481

Similar Items