Aurkibidea: :: Library Catalog

Gorde:

Xehetasun bibliografikoak
Egile Nagusiak:	Alman, Josh, Song, Zhao
Formatua:	Preprint
Argitaratua:	2024
Gaiak:	Machine Learning Computational Complexity Computation and Language Data Structures and Algorithms
Sarrera elektronikoa:	https://arxiv.org/abs/2402.04497
Etiketak:	Etiketa erantsi Etiketarik gabe, Izan zaitez lehena erregistro honi etiketa jartzen!

Aurkibidea:

Large language models (LLMs) have made fundamental contributions over the last a few years. To train an LLM, one needs to alternatingly run `forward' computations and `backward' computations. The forward computation can be viewed as attention function evaluation, and the backward computation can be viewed as a gradient computation. In previous work by [Alman and Song, NeurIPS 2023], it was proved that the forward step can be performed in almost-linear time in certain parameter regimes, but that there is no truly sub-quadratic time algorithm in the remaining parameter regimes unless the popular hypothesis SETH is false. In this work, we show nearly identical results for the harder-seeming problem of computing the gradient of loss function of one layer attention network, and thus for the entire process of LLM training. This completely characterizes the fine-grained complexity of every step of LLM training.

Antzeko izenburuak