Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ma, Liqun, Sun, Mingjie, Shen, Zhiqiang
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2407.07093
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914863345827840
author	Ma, Liqun Sun, Mingjie Shen, Zhiqiang
author_facet	Ma, Liqun Sun, Mingjie Shen, Zhiqiang
contents	This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary LLM like BitNet b1.58) to match the performance of its full-precision counterparts (e.g., FP16 or BF16) in transformer-based LLMs. It achieves this by employing an autoregressive distillation (AD) loss with maintaining equivalent model dimensions (130M, 1.3B, 7B) and training data volume as regular LLM pretraining, while delivering competitive results in terms of perplexity and task-specific effectiveness. Intriguingly, by analyzing the training trajectory, we find that the pretrained weight is not necessary for training binarized LLMs from scratch. This research encourages a new computational framework and may facilitate the future design of specialized hardware tailored for fully 1-bit LLMs. We make all models, code, and training dataset fully accessible and transparent to support further research (Code: https://github.com/LiqunMa/FBI-LLM. Model: https://huggingface.co/LiqunMa/).
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_07093
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation Ma, Liqun Sun, Mingjie Shen, Zhiqiang Computation and Language Artificial Intelligence Machine Learning This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary LLM like BitNet b1.58) to match the performance of its full-precision counterparts (e.g., FP16 or BF16) in transformer-based LLMs. It achieves this by employing an autoregressive distillation (AD) loss with maintaining equivalent model dimensions (130M, 1.3B, 7B) and training data volume as regular LLM pretraining, while delivering competitive results in terms of perplexity and task-specific effectiveness. Intriguingly, by analyzing the training trajectory, we find that the pretrained weight is not necessary for training binarized LLMs from scratch. This research encourages a new computational framework and may facilitate the future design of specialized hardware tailored for fully 1-bit LLMs. We make all models, code, and training dataset fully accessible and transparent to support further research (Code: https://github.com/LiqunMa/FBI-LLM. Model: https://huggingface.co/LiqunMa/).
title	FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
topic	Computation and Language Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2407.07093

Similar Items