Guardado en:
Detalles Bibliográficos
Autores principales: Chen, Ping, Zhang, Wenjie, He, Shuibing, Chen, Weijian, Yang, Siling, Huang, Kexin, Yin, Yanlong, Zhan, Xuan, Gu, Yingjie, Peng, Zhuwei, Zheng, Yi, Wang, Zhefeng, Chen, Gang
Formato: Preprint
Publicado: 2024
Materias:
Acceso en línea:https://arxiv.org/abs/2406.08756
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866916664289787904
author Chen, Ping
Zhang, Wenjie
He, Shuibing
Chen, Weijian
Yang, Siling
Huang, Kexin
Yin, Yanlong
Zhan, Xuan
Gu, Yingjie
Peng, Zhuwei
Zheng, Yi
Wang, Zhefeng
Chen, Gang
author_facet Chen, Ping
Zhang, Wenjie
He, Shuibing
Chen, Weijian
Yang, Siling
Huang, Kexin
Yin, Yanlong
Zhan, Xuan
Gu, Yingjie
Peng, Zhuwei
Zheng, Yi
Wang, Zhefeng
Chen, Gang
contents Large model training often uses recomputation to alleviate memory pressure and pipelines to exploit the parallelism of data, tensors, and devices. However, existing recomputation approaches may incur high overhead when training real-world models, as they are executed on demand in the critical training path. In this paper, we present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines. To reduce the large search space for recomputation strategies, we propose a heuristic-based recomputation scheduling algorithm, which is based on the observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all such structures. Additionally, we propose a recomputation-aware model partitioning method to balance each stage's execution time for improved training throughput. Our comprehensive evaluation using GPT models with 1.3B-23B parameters shows that Lynx outperforms existing recomputation approaches by up to 1.37x.
format Preprint
id arxiv_https___arxiv_org_abs_2406_08756
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Optimizing Large Model Training through Overlapped Activation Recomputation
Chen, Ping
Zhang, Wenjie
He, Shuibing
Chen, Weijian
Yang, Siling
Huang, Kexin
Yin, Yanlong
Zhan, Xuan
Gu, Yingjie
Peng, Zhuwei
Zheng, Yi
Wang, Zhefeng
Chen, Gang
Distributed, Parallel, and Cluster Computing
Machine Learning
Large model training often uses recomputation to alleviate memory pressure and pipelines to exploit the parallelism of data, tensors, and devices. However, existing recomputation approaches may incur high overhead when training real-world models, as they are executed on demand in the critical training path. In this paper, we present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines. To reduce the large search space for recomputation strategies, we propose a heuristic-based recomputation scheduling algorithm, which is based on the observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all such structures. Additionally, we propose a recomputation-aware model partitioning method to balance each stage's execution time for improved training throughput. Our comprehensive evaluation using GPT models with 1.3B-23B parameters shows that Lynx outperforms existing recomputation approaches by up to 1.37x.
title Optimizing Large Model Training through Overlapped Activation Recomputation
topic Distributed, Parallel, and Cluster Computing
Machine Learning
url https://arxiv.org/abs/2406.08756