Salvato in:
Dettagli Bibliografici
Autori principali: Wang, Zezhou, Li, Youjie, Lin, Zhiqi, Yang, Jiacheng, Xie, Cong, Feng, Guanyu, Zhong, Zheng, Huang, Ziyue, Zhu, Hongyu, Zhang, Zhi, Peng, Yanghua, Liu, Xin
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2602.22437
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866913052997189632
author Wang, Zezhou
Li, Youjie
Lin, Zhiqi
Yang, Jiacheng
Xie, Cong
Feng, Guanyu
Zhong, Zheng
Huang, Ziyue
Zhu, Hongyu
Zhang, Zhi
Peng, Yanghua
Liu, Xin
author_facet Wang, Zezhou
Li, Youjie
Lin, Zhiqi
Yang, Jiacheng
Xie, Cong
Feng, Guanyu
Zhong, Zheng
Huang, Ziyue
Zhu, Hongyu
Zhang, Zhi
Peng, Yanghua
Liu, Xin
contents Fully Sharded Data Parallel (FSDP), also known as Zero Redundancy Optimizer (ZeRO), is widely used for large-scale model training, because of its memory efficiency and minimal intrusion on model code. However, existing FSDP systems rely on fixed element-wise or row-wise sharding formats that conflict with block-structured computations. As a result, they struggle to support modern structure-aware training methods, including block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. In addition, today's implementations incur communication and memory overheads that degrade efficiency at the scale of tens of thousands of GPUs. We introduce veScale-FSDP, a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. veScale-FSDP enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.
format Preprint
id arxiv_https___arxiv_org_abs_2602_22437
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle veScale-FSDP: Flexible and High-Performance FSDP at Scale
Wang, Zezhou
Li, Youjie
Lin, Zhiqi
Yang, Jiacheng
Xie, Cong
Feng, Guanyu
Zhong, Zheng
Huang, Ziyue
Zhu, Hongyu
Zhang, Zhi
Peng, Yanghua
Liu, Xin
Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Machine Learning
Fully Sharded Data Parallel (FSDP), also known as Zero Redundancy Optimizer (ZeRO), is widely used for large-scale model training, because of its memory efficiency and minimal intrusion on model code. However, existing FSDP systems rely on fixed element-wise or row-wise sharding formats that conflict with block-structured computations. As a result, they struggle to support modern structure-aware training methods, including block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. In addition, today's implementations incur communication and memory overheads that degrade efficiency at the scale of tens of thousands of GPUs. We introduce veScale-FSDP, a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. veScale-FSDP enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.
title veScale-FSDP: Flexible and High-Performance FSDP at Scale
topic Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Machine Learning
url https://arxiv.org/abs/2602.22437