Saved in:
Bibliographic Details
Main Authors: Debnath, Rubi, Mateu, Daniel Bujosa, Zhao, Luxi, Craciunas, Silviu S., Pop, Paul, Steinhorst, Sebastian
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.09481
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917478125273088
author Debnath, Rubi
Mateu, Daniel Bujosa
Zhao, Luxi
Craciunas, Silviu S.
Pop, Paul
Steinhorst, Sebastian
author_facet Debnath, Rubi
Mateu, Daniel Bujosa
Zhao, Luxi
Craciunas, Silviu S.
Pop, Paul
Steinhorst, Sebastian
contents We present TSNBench, the first benchmark for evaluating large language model (LLM) proficiency in Time-Sensitive Networking (TSN), a suite of IEEE 802.1 standards for deterministic communication with bounded latency in safety-critical domains such as autonomous vehicles, aviation, defense, and industrial automation. While LLMs have been extensively evaluated on general knowledge tasks, their capabilities in safety-critical networking domains remain largely unexplored. TSNBench comprises 939 expert-validated multiple-choice questions (MCQs) covering diverse TSN mechanisms, along with 100 open-ended Worst-Case Delay (WCD) computation tasks for Credit-Based Shaper (CBS) and Cyclic Queuing and Forwarding (CQF) across varying network topologies and traffic conditions. MCQ answers are validated by domain experts, and open-ended ground truth WCD values are computed using a verified Network Calculus (NC) solver for CBS and closed-form mathematical upper bounds for CQF. We evaluate 16 LLMs and find that although models achieve 67 to 95% accuracy on MCQs, they fail substantially on open-ended WCD computation. For CBS, only GPT-5 achieves a Mean Absolute Percentage Error (MAPE) of 36.2%, meaning its predicted WCD deviates by 36.2% of the actual TSN flow delay on average, while most models exceed 80%. For CQF, the best model achieves 41.8% MAPE, with most models clustering between 80% and 100%. Such errors are large relative to TSN latency budgets and can lead to violations of real-time constraints and unsafe configurations. TSNBench demonstrates that MCQ benchmarks may overestimate LLM capabilities in safety-critical networking domains.
format Preprint
id arxiv_https___arxiv_org_abs_2605_09481
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle TSNBench: Benchmarking LLM Proficiency in Time-Sensitive Networking
Debnath, Rubi
Mateu, Daniel Bujosa
Zhao, Luxi
Craciunas, Silviu S.
Pop, Paul
Steinhorst, Sebastian
Networking and Internet Architecture
We present TSNBench, the first benchmark for evaluating large language model (LLM) proficiency in Time-Sensitive Networking (TSN), a suite of IEEE 802.1 standards for deterministic communication with bounded latency in safety-critical domains such as autonomous vehicles, aviation, defense, and industrial automation. While LLMs have been extensively evaluated on general knowledge tasks, their capabilities in safety-critical networking domains remain largely unexplored. TSNBench comprises 939 expert-validated multiple-choice questions (MCQs) covering diverse TSN mechanisms, along with 100 open-ended Worst-Case Delay (WCD) computation tasks for Credit-Based Shaper (CBS) and Cyclic Queuing and Forwarding (CQF) across varying network topologies and traffic conditions. MCQ answers are validated by domain experts, and open-ended ground truth WCD values are computed using a verified Network Calculus (NC) solver for CBS and closed-form mathematical upper bounds for CQF. We evaluate 16 LLMs and find that although models achieve 67 to 95% accuracy on MCQs, they fail substantially on open-ended WCD computation. For CBS, only GPT-5 achieves a Mean Absolute Percentage Error (MAPE) of 36.2%, meaning its predicted WCD deviates by 36.2% of the actual TSN flow delay on average, while most models exceed 80%. For CQF, the best model achieves 41.8% MAPE, with most models clustering between 80% and 100%. Such errors are large relative to TSN latency budgets and can lead to violations of real-time constraints and unsafe configurations. TSNBench demonstrates that MCQ benchmarks may overestimate LLM capabilities in safety-critical networking domains.
title TSNBench: Benchmarking LLM Proficiency in Time-Sensitive Networking
topic Networking and Internet Architecture
url https://arxiv.org/abs/2605.09481