MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Wang, Zezhou, Li, Youjie, Lin, Zhiqi, Yang, Jiacheng, Xie, Cong, Feng, Guanyu, Zhong, Zheng, Huang, Ziyue, Zhu, Hongyu, Zhang, Zhi, Peng, Yanghua, Liu, Xin
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning
Accesso online:	https://arxiv.org/abs/2602.22437
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866913052997189632
author	Wang, Zezhou Li, Youjie Lin, Zhiqi Yang, Jiacheng Xie, Cong Feng, Guanyu Zhong, Zheng Huang, Ziyue Zhu, Hongyu Zhang, Zhi Peng, Yanghua Liu, Xin
author_facet	Wang, Zezhou Li, Youjie Lin, Zhiqi Yang, Jiacheng Xie, Cong Feng, Guanyu Zhong, Zheng Huang, Ziyue Zhu, Hongyu Zhang, Zhi Peng, Yanghua Liu, Xin
contents	Fully Sharded Data Parallel (FSDP), also known as Zero Redundancy Optimizer (ZeRO), is widely used for large-scale model training, because of its memory efficiency and minimal intrusion on model code. However, existing FSDP systems rely on fixed element-wise or row-wise sharding formats that conflict with block-structured computations. As a result, they struggle to support modern structure-aware training methods, including block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. In addition, today's implementations incur communication and memory overheads that degrade efficiency at the scale of tens of thousands of GPUs. We introduce veScale-FSDP, a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. veScale-FSDP enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_22437
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	veScale-FSDP: Flexible and High-Performance FSDP at Scale Wang, Zezhou Li, Youjie Lin, Zhiqi Yang, Jiacheng Xie, Cong Feng, Guanyu Zhong, Zheng Huang, Ziyue Zhu, Hongyu Zhang, Zhi Peng, Yanghua Liu, Xin Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning Fully Sharded Data Parallel (FSDP), also known as Zero Redundancy Optimizer (ZeRO), is widely used for large-scale model training, because of its memory efficiency and minimal intrusion on model code. However, existing FSDP systems rely on fixed element-wise or row-wise sharding formats that conflict with block-structured computations. As a result, they struggle to support modern structure-aware training methods, including block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. In addition, today's implementations incur communication and memory overheads that degrade efficiency at the scale of tens of thousands of GPUs. We introduce veScale-FSDP, a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. veScale-FSDP enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.
title	veScale-FSDP: Flexible and High-Performance FSDP at Scale
topic	Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2602.22437

Documenti analoghi