Saved in:
Bibliographic Details
Main Authors: Novikov, Georgii, Oseledets, Ivan
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.15545
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912059716796416
author Novikov, Georgii
Oseledets, Ivan
author_facet Novikov, Georgii
Oseledets, Ivan
contents The scaling of neural networks with increasing data and model sizes necessitates the development of more efficient deep learning algorithms. A significant challenge in neural network training is the memory footprint associated with activation tensors, particularly in pointwise nonlinearity layers that traditionally save the entire input tensor for the backward pass, leading to substantial memory consumption. In this paper, we propose a modification to the handling of activation tensors in pointwise nonlinearity layers. Our method involves saving the output tensor instead of the input tensor during the forward pass. Since the subsequent layer typically also saves its input tensor, this approach reduces the total memory required by storing only one tensor between layers instead of two. This optimization is especially beneficial for transformer-based architectures like GPT, BERT, Mistral, and Llama. To enable this approach, we utilize the inverse function of the nonlinearity during the backward pass. As the inverse cannot be computed analytically for most nonlinearities, we construct accurate approximations using simpler functions. Experimental results demonstrate that our method significantly reduces memory usage without affecting training accuracy or computational performance. Our implementation is provided as a drop-in replacement for standard nonlinearity layers in the PyTorch framework, facilitating easy adoption without requiring architectural modifications.
format Preprint
id arxiv_https___arxiv_org_abs_2407_15545
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Inverted Activations: Reducing Memory Footprint in Neural Network Training
Novikov, Georgii
Oseledets, Ivan
Machine Learning
The scaling of neural networks with increasing data and model sizes necessitates the development of more efficient deep learning algorithms. A significant challenge in neural network training is the memory footprint associated with activation tensors, particularly in pointwise nonlinearity layers that traditionally save the entire input tensor for the backward pass, leading to substantial memory consumption. In this paper, we propose a modification to the handling of activation tensors in pointwise nonlinearity layers. Our method involves saving the output tensor instead of the input tensor during the forward pass. Since the subsequent layer typically also saves its input tensor, this approach reduces the total memory required by storing only one tensor between layers instead of two. This optimization is especially beneficial for transformer-based architectures like GPT, BERT, Mistral, and Llama. To enable this approach, we utilize the inverse function of the nonlinearity during the backward pass. As the inverse cannot be computed analytically for most nonlinearities, we construct accurate approximations using simpler functions. Experimental results demonstrate that our method significantly reduces memory usage without affecting training accuracy or computational performance. Our implementation is provided as a drop-in replacement for standard nonlinearity layers in the PyTorch framework, facilitating easy adoption without requiring architectural modifications.
title Inverted Activations: Reducing Memory Footprint in Neural Network Training
topic Machine Learning
url https://arxiv.org/abs/2407.15545