Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chou, Yuhong, Liu, Zehao, Zhu, Ruijie, Wan, Xinyi, Li, Tianjian, Chu, Congying, Liu, Qian, Wu, Jibin, Ma, Zejun
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2507.01004
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909671449690112
author	Chou, Yuhong Liu, Zehao Zhu, Ruijie Wan, Xinyi Li, Tianjian Chu, Congying Liu, Qian Wu, Jibin Ma, Zejun
author_facet	Chou, Yuhong Liu, Zehao Zhu, Ruijie Wan, Xinyi Li, Tianjian Chu, Congying Liu, Qian Wu, Jibin Ma, Zejun
contents	Linear attention mechanisms deliver significant advantages for Large Language Models (LLMs) by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence Parallelism (SP) methods, essential for distributing these workloads across devices, become the primary bottleneck due to substantial communication overhead. In this paper, we introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models, a new SP method designed to overcome these limitations and achieve end-to-end near-linear scalability for long sequence training. For example, training a model with a 1M sequence length across 64 devices using ZeCO takes roughly the same time as training with an 16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new collective communication primitive. All-Scan provides each SP rank with precisely the initial operator state it requires while maintaining a minimal communication footprint, effectively eliminating communication overhead. Theoretically, we prove the optimaity of ZeCO, showing that it introduces only negligible time and space overhead. Empirically, we compare the communication costs of different sequence parallelism strategies and demonstrate that All-Scan achieves the fastest communication in SP scenarios. Specifically, on 256 GPUs with an 8M sequence length, ZeCO achieves a 60\% speedup compared to the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a clear path toward efficiently training next-generation LLMs on previously intractable sequence lengths.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_01004
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention Chou, Yuhong Liu, Zehao Zhu, Ruijie Wan, Xinyi Li, Tianjian Chu, Congying Liu, Qian Wu, Jibin Ma, Zejun Machine Learning Linear attention mechanisms deliver significant advantages for Large Language Models (LLMs) by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence Parallelism (SP) methods, essential for distributing these workloads across devices, become the primary bottleneck due to substantial communication overhead. In this paper, we introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models, a new SP method designed to overcome these limitations and achieve end-to-end near-linear scalability for long sequence training. For example, training a model with a 1M sequence length across 64 devices using ZeCO takes roughly the same time as training with an 16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new collective communication primitive. All-Scan provides each SP rank with precisely the initial operator state it requires while maintaining a minimal communication footprint, effectively eliminating communication overhead. Theoretically, we prove the optimaity of ZeCO, showing that it introduces only negligible time and space overhead. Empirically, we compare the communication costs of different sequence parallelism strategies and demonstrate that All-Scan achieves the fastest communication in SP scenarios. Specifically, on 256 GPUs with an 8M sequence length, ZeCO achieves a 60\% speedup compared to the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a clear path toward efficiently training next-generation LLMs on previously intractable sequence lengths.
title	ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention
topic	Machine Learning
url	https://arxiv.org/abs/2507.01004

Similar Items