Visualització del personal: :: Library Catalog

Guardat en:

Dades bibliogràfiques
Autors principals:	Zhang, Hanlin, Morwani, Depen, Vyas, Nikhil, Wu, Jingfeng, Zou, Difan, Ghai, Udaya, Foster, Dean, Kakade, Sham
Format:	Preprint
Publicat:	2024
Matèries:	Machine Learning Artificial Intelligence Optimization and Control
Accés en línia:	https://arxiv.org/abs/2410.21676
Etiquetes:	Afegir etiqueta Sense etiquetes, Sigues el primer a etiquetar aquest registre!

_version_	1866917991424196608
author	Zhang, Hanlin Morwani, Depen Vyas, Nikhil Wu, Jingfeng Zou, Difan Ghai, Udaya Foster, Dean Kakade, Sham
author_facet	Zhang, Hanlin Morwani, Depen Vyas, Nikhil Wu, Jingfeng Zou, Difan Ghai, Udaya Foster, Dean Kakade, Sham
contents	Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the compromise between time and compute, marks the threshold beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive language models, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control of factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression. Of independent interest, we highlight the importance of common hyper-parameter choices and strategies for studying large-scale pre-training beyond fixed training durations.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_21676
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	How Does Critical Batch Size Scale in Pre-training? Zhang, Hanlin Morwani, Depen Vyas, Nikhil Wu, Jingfeng Zou, Difan Ghai, Udaya Foster, Dean Kakade, Sham Machine Learning Artificial Intelligence Optimization and Control Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the compromise between time and compute, marks the threshold beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive language models, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control of factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression. Of independent interest, we highlight the importance of common hyper-parameter choices and strategies for studying large-scale pre-training beyond fixed training durations.
title	How Does Critical Batch Size Scale in Pre-training?
topic	Machine Learning Artificial Intelligence Optimization and Control
url	https://arxiv.org/abs/2410.21676

Ítems similars