Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Khashab, Sajy, Alcoz, Albert Gran, Gal, Alon, Romano, Jacky, Abboud, Rani, Piasetzky, Yonatan, Maman, Lior, Nishry, Amit, Gafni, Barak, Shabtai, Omer, Kadosh, Matty, Goldenberg, Dror, Shainer, Gilad, Silberstein, Mark
Format:	Preprint
Published:	2026
Subjects:	Networking and Internet Architecture Artificial Intelligence Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2605.21187
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916035013115904
author	Khashab, Sajy Alcoz, Albert Gran Gal, Alon Romano, Jacky Abboud, Rani Piasetzky, Yonatan Maman, Lior Nishry, Amit Gafni, Barak Shabtai, Omer Kadosh, Matty Goldenberg, Dror Shainer, Gilad Silberstein, Mark
author_facet	Khashab, Sajy Alcoz, Albert Gran Gal, Alon Romano, Jacky Abboud, Rani Piasetzky, Yonatan Maman, Lior Nishry, Amit Gafni, Barak Shabtai, Omer Kadosh, Matty Goldenberg, Dror Shainer, Gilad Silberstein, Mark
contents	As distributed model training scales to span hundreds of thousands of GPUs, scale-out networks face unprecedented performance and efficiency demands. NVIDIA Spectrum-X Ethernet has been designed from the ground up to achieve predictable and stable network performance with high utilization and low latency. This paper presents the Spectrum-X multiplane architecture, which replaces hierarchical depth with topological parallelism, and introduces hardware-accelerated load balancing in NICs and switches as the key architectural approach to provide fast reaction to highly dynamic network conditions at the microsecond timescales that AI training workloads demand. We describe the motivation, design principles, evaluation methodology and performance on state-of-the-art benchmarks, as well as the lessons we learned from deploying and debugging Spectrum-X networks in large-scale systems. Our evaluation highlights production-grade AI infrastructure performance across three core dimensions: 98% of the theoretical line rate with low jitter-free latency; strong cross-tenant isolation for concurrent workloads; robust, capacity-proportional bisection bandwidth and 7% latency increase for 10% fabric link failures; and rapid reaction to host and fabric link flaps during LLM training workloads.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_21187
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	High-speed Networking for Giga-Scale AI Factories Khashab, Sajy Alcoz, Albert Gran Gal, Alon Romano, Jacky Abboud, Rani Piasetzky, Yonatan Maman, Lior Nishry, Amit Gafni, Barak Shabtai, Omer Kadosh, Matty Goldenberg, Dror Shainer, Gilad Silberstein, Mark Networking and Internet Architecture Artificial Intelligence Distributed, Parallel, and Cluster Computing As distributed model training scales to span hundreds of thousands of GPUs, scale-out networks face unprecedented performance and efficiency demands. NVIDIA Spectrum-X Ethernet has been designed from the ground up to achieve predictable and stable network performance with high utilization and low latency. This paper presents the Spectrum-X multiplane architecture, which replaces hierarchical depth with topological parallelism, and introduces hardware-accelerated load balancing in NICs and switches as the key architectural approach to provide fast reaction to highly dynamic network conditions at the microsecond timescales that AI training workloads demand. We describe the motivation, design principles, evaluation methodology and performance on state-of-the-art benchmarks, as well as the lessons we learned from deploying and debugging Spectrum-X networks in large-scale systems. Our evaluation highlights production-grade AI infrastructure performance across three core dimensions: 98% of the theoretical line rate with low jitter-free latency; strong cross-tenant isolation for concurrent workloads; robust, capacity-proportional bisection bandwidth and 7% latency increase for 10% fabric link failures; and rapid reaction to host and fabric link flaps during LLM training workloads.
title	High-speed Networking for Giga-Scale AI Factories
topic	Networking and Internet Architecture Artificial Intelligence Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2605.21187

Similar Items