Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, James, Xiao, Guangxuan, Li, Kai, Lee, Jason D., Han, Song, Dao, Tri, Cai, Tianle
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2402.10193
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916435829194752
author	Liu, James Xiao, Guangxuan Li, Kai Lee, Jason D. Han, Song Dao, Tri Cai, Tianle
author_facet	Liu, James Xiao, Guangxuan Li, Kai Lee, Jason D. Han, Song Dao, Tri Cai, Tianle
contents	Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_10193
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	BitDelta: Your Fine-Tune May Only Be Worth One Bit Liu, James Xiao, Guangxuan Li, Kai Lee, Jason D. Han, Song Dao, Tri Cai, Tianle Machine Learning Computation and Language Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.
title	BitDelta: Your Fine-Tune May Only Be Worth One Bit
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2402.10193

Similar Items