Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Novikov, Georgii, Oseledets, Ivan
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2407.15545
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912059716796416
author	Novikov, Georgii Oseledets, Ivan
author_facet	Novikov, Georgii Oseledets, Ivan
contents	The scaling of neural networks with increasing data and model sizes necessitates the development of more efficient deep learning algorithms. A significant challenge in neural network training is the memory footprint associated with activation tensors, particularly in pointwise nonlinearity layers that traditionally save the entire input tensor for the backward pass, leading to substantial memory consumption. In this paper, we propose a modification to the handling of activation tensors in pointwise nonlinearity layers. Our method involves saving the output tensor instead of the input tensor during the forward pass. Since the subsequent layer typically also saves its input tensor, this approach reduces the total memory required by storing only one tensor between layers instead of two. This optimization is especially beneficial for transformer-based architectures like GPT, BERT, Mistral, and Llama. To enable this approach, we utilize the inverse function of the nonlinearity during the backward pass. As the inverse cannot be computed analytically for most nonlinearities, we construct accurate approximations using simpler functions. Experimental results demonstrate that our method significantly reduces memory usage without affecting training accuracy or computational performance. Our implementation is provided as a drop-in replacement for standard nonlinearity layers in the PyTorch framework, facilitating easy adoption without requiring architectural modifications.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_15545
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Inverted Activations: Reducing Memory Footprint in Neural Network Training Novikov, Georgii Oseledets, Ivan Machine Learning The scaling of neural networks with increasing data and model sizes necessitates the development of more efficient deep learning algorithms. A significant challenge in neural network training is the memory footprint associated with activation tensors, particularly in pointwise nonlinearity layers that traditionally save the entire input tensor for the backward pass, leading to substantial memory consumption. In this paper, we propose a modification to the handling of activation tensors in pointwise nonlinearity layers. Our method involves saving the output tensor instead of the input tensor during the forward pass. Since the subsequent layer typically also saves its input tensor, this approach reduces the total memory required by storing only one tensor between layers instead of two. This optimization is especially beneficial for transformer-based architectures like GPT, BERT, Mistral, and Llama. To enable this approach, we utilize the inverse function of the nonlinearity during the backward pass. As the inverse cannot be computed analytically for most nonlinearities, we construct accurate approximations using simpler functions. Experimental results demonstrate that our method significantly reduces memory usage without affecting training accuracy or computational performance. Our implementation is provided as a drop-in replacement for standard nonlinearity layers in the PyTorch framework, facilitating easy adoption without requiring architectural modifications.
title	Inverted Activations: Reducing Memory Footprint in Neural Network Training
topic	Machine Learning
url	https://arxiv.org/abs/2407.15545

Similar Items