Guardado en:
Detalles Bibliográficos
Autores principales: Yu, Tao, Gupta, Gaurav, Gopalswamy, Karthick, Mamidala, Amith, Zhou, Hao, Huynh, Jeffrey, Park, Youngsuk, Diamant, Ron, Deoras, Anoop, Huan, Luke
Formato: Preprint
Publicado: 2024
Materias:
Acceso en línea:https://arxiv.org/abs/2405.03637
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866913342842470400
author Yu, Tao
Gupta, Gaurav
Gopalswamy, Karthick
Mamidala, Amith
Zhou, Hao
Huynh, Jeffrey
Park, Youngsuk
Diamant, Ron
Deoras, Anoop
Huan, Luke
author_facet Yu, Tao
Gupta, Gaurav
Gopalswamy, Karthick
Mamidala, Amith
Zhou, Hao
Huynh, Jeffrey
Park, Youngsuk
Diamant, Ron
Deoras, Anoop
Huan, Luke
contents Large models training is plagued by the intense compute cost and limited hardware memory. A practical solution is low-precision representation but is troubled by loss in numerical accuracy and unstable training rendering the model less useful. We argue that low-precision floating points can perform well provided the error is properly compensated at the critical locations in the training process. We propose Collage which utilizes multi-component float representation in low-precision to accurately perform operations with numerical errors accounted. To understand the impact of imprecision to training, we propose a simple and novel metric which tracks the lost information during training as well as differentiates various precision strategies. Our method works with commonly used low-precision such as half-precision ($16$-bit floating points) and can be naturally extended to work with even lower precision such as $8$-bit. Experimental results show that pre-training using Collage removes the requirement of using $32$-bit floating-point copies of the model and attains similar/better training performance compared to $(16, 32)$-bit mixed-precision strategy, with up to $3.7\times$ speedup and $\sim 15\%$ to $23\%$ less memory usage in practice.
format Preprint
id arxiv_https___arxiv_org_abs_2405_03637
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Collage: Light-Weight Low-Precision Strategy for LLM Training
Yu, Tao
Gupta, Gaurav
Gopalswamy, Karthick
Mamidala, Amith
Zhou, Hao
Huynh, Jeffrey
Park, Youngsuk
Diamant, Ron
Deoras, Anoop
Huan, Luke
Machine Learning
Large models training is plagued by the intense compute cost and limited hardware memory. A practical solution is low-precision representation but is troubled by loss in numerical accuracy and unstable training rendering the model less useful. We argue that low-precision floating points can perform well provided the error is properly compensated at the critical locations in the training process. We propose Collage which utilizes multi-component float representation in low-precision to accurately perform operations with numerical errors accounted. To understand the impact of imprecision to training, we propose a simple and novel metric which tracks the lost information during training as well as differentiates various precision strategies. Our method works with commonly used low-precision such as half-precision ($16$-bit floating points) and can be naturally extended to work with even lower precision such as $8$-bit. Experimental results show that pre-training using Collage removes the requirement of using $32$-bit floating-point copies of the model and attains similar/better training performance compared to $(16, 32)$-bit mixed-precision strategy, with up to $3.7\times$ speedup and $\sim 15\%$ to $23\%$ less memory usage in practice.
title Collage: Light-Weight Low-Precision Strategy for LLM Training
topic Machine Learning
url https://arxiv.org/abs/2405.03637