Saved in:
Bibliographic Details
Main Authors: Khashab, Sajy, Alcoz, Albert Gran, Gal, Alon, Romano, Jacky, Abboud, Rani, Piasetzky, Yonatan, Maman, Lior, Nishry, Amit, Gafni, Barak, Shabtai, Omer, Kadosh, Matty, Goldenberg, Dror, Shainer, Gilad, Silberstein, Mark
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.21187
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916035013115904
author Khashab, Sajy
Alcoz, Albert Gran
Gal, Alon
Romano, Jacky
Abboud, Rani
Piasetzky, Yonatan
Maman, Lior
Nishry, Amit
Gafni, Barak
Shabtai, Omer
Kadosh, Matty
Goldenberg, Dror
Shainer, Gilad
Silberstein, Mark
author_facet Khashab, Sajy
Alcoz, Albert Gran
Gal, Alon
Romano, Jacky
Abboud, Rani
Piasetzky, Yonatan
Maman, Lior
Nishry, Amit
Gafni, Barak
Shabtai, Omer
Kadosh, Matty
Goldenberg, Dror
Shainer, Gilad
Silberstein, Mark
contents As distributed model training scales to span hundreds of thousands of GPUs, scale-out networks face unprecedented performance and efficiency demands. NVIDIA Spectrum-X Ethernet has been designed from the ground up to achieve predictable and stable network performance with high utilization and low latency. This paper presents the Spectrum-X multiplane architecture, which replaces hierarchical depth with topological parallelism, and introduces hardware-accelerated load balancing in NICs and switches as the key architectural approach to provide fast reaction to highly dynamic network conditions at the microsecond timescales that AI training workloads demand. We describe the motivation, design principles, evaluation methodology and performance on state-of-the-art benchmarks, as well as the lessons we learned from deploying and debugging Spectrum-X networks in large-scale systems. Our evaluation highlights production-grade AI infrastructure performance across three core dimensions: 98% of the theoretical line rate with low jitter-free latency; strong cross-tenant isolation for concurrent workloads; robust, capacity-proportional bisection bandwidth and 7% latency increase for 10% fabric link failures; and rapid reaction to host and fabric link flaps during LLM training workloads.
format Preprint
id arxiv_https___arxiv_org_abs_2605_21187
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle High-speed Networking for Giga-Scale AI Factories
Khashab, Sajy
Alcoz, Albert Gran
Gal, Alon
Romano, Jacky
Abboud, Rani
Piasetzky, Yonatan
Maman, Lior
Nishry, Amit
Gafni, Barak
Shabtai, Omer
Kadosh, Matty
Goldenberg, Dror
Shainer, Gilad
Silberstein, Mark
Networking and Internet Architecture
Artificial Intelligence
Distributed, Parallel, and Cluster Computing
As distributed model training scales to span hundreds of thousands of GPUs, scale-out networks face unprecedented performance and efficiency demands. NVIDIA Spectrum-X Ethernet has been designed from the ground up to achieve predictable and stable network performance with high utilization and low latency. This paper presents the Spectrum-X multiplane architecture, which replaces hierarchical depth with topological parallelism, and introduces hardware-accelerated load balancing in NICs and switches as the key architectural approach to provide fast reaction to highly dynamic network conditions at the microsecond timescales that AI training workloads demand. We describe the motivation, design principles, evaluation methodology and performance on state-of-the-art benchmarks, as well as the lessons we learned from deploying and debugging Spectrum-X networks in large-scale systems. Our evaluation highlights production-grade AI infrastructure performance across three core dimensions: 98% of the theoretical line rate with low jitter-free latency; strong cross-tenant isolation for concurrent workloads; robust, capacity-proportional bisection bandwidth and 7% latency increase for 10% fabric link failures; and rapid reaction to host and fabric link flaps during LLM training workloads.
title High-speed Networking for Giga-Scale AI Factories
topic Networking and Internet Architecture
Artificial Intelligence
Distributed, Parallel, and Cluster Computing
url https://arxiv.org/abs/2605.21187