Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Chen, Ping, Zhang, Wenjie, He, Shuibing, Chen, Weijian, Yang, Siling, Huang, Kexin, Yin, Yanlong, Zhan, Xuan, Gu, Yingjie, Peng, Zhuwei, Zheng, Yi, Wang, Zhefeng, Chen, Gang
Formato:	Preprint
Publicado:	2024
Materias:	Distributed, Parallel, and Cluster Computing Machine Learning
Acceso en línea:	https://arxiv.org/abs/2406.08756
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866916664289787904
author	Chen, Ping Zhang, Wenjie He, Shuibing Chen, Weijian Yang, Siling Huang, Kexin Yin, Yanlong Zhan, Xuan Gu, Yingjie Peng, Zhuwei Zheng, Yi Wang, Zhefeng Chen, Gang
author_facet	Chen, Ping Zhang, Wenjie He, Shuibing Chen, Weijian Yang, Siling Huang, Kexin Yin, Yanlong Zhan, Xuan Gu, Yingjie Peng, Zhuwei Zheng, Yi Wang, Zhefeng Chen, Gang
contents	Large model training often uses recomputation to alleviate memory pressure and pipelines to exploit the parallelism of data, tensors, and devices. However, existing recomputation approaches may incur high overhead when training real-world models, as they are executed on demand in the critical training path. In this paper, we present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines. To reduce the large search space for recomputation strategies, we propose a heuristic-based recomputation scheduling algorithm, which is based on the observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all such structures. Additionally, we propose a recomputation-aware model partitioning method to balance each stage's execution time for improved training throughput. Our comprehensive evaluation using GPT models with 1.3B-23B parameters shows that Lynx outperforms existing recomputation approaches by up to 1.37x.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_08756
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Optimizing Large Model Training through Overlapped Activation Recomputation Chen, Ping Zhang, Wenjie He, Shuibing Chen, Weijian Yang, Siling Huang, Kexin Yin, Yanlong Zhan, Xuan Gu, Yingjie Peng, Zhuwei Zheng, Yi Wang, Zhefeng Chen, Gang Distributed, Parallel, and Cluster Computing Machine Learning Large model training often uses recomputation to alleviate memory pressure and pipelines to exploit the parallelism of data, tensors, and devices. However, existing recomputation approaches may incur high overhead when training real-world models, as they are executed on demand in the critical training path. In this paper, we present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines. To reduce the large search space for recomputation strategies, we propose a heuristic-based recomputation scheduling algorithm, which is based on the observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all such structures. Additionally, we propose a recomputation-aware model partitioning method to balance each stage's execution time for improved training throughput. Our comprehensive evaluation using GPT models with 1.3B-23B parameters shows that Lynx outperforms existing recomputation approaches by up to 1.37x.
title	Optimizing Large Model Training through Overlapped Activation Recomputation
topic	Distributed, Parallel, and Cluster Computing Machine Learning
url	https://arxiv.org/abs/2406.08756

Ejemplares similares