Saved in:
Bibliographic Details
Main Authors: Ma, Liqun, Sun, Mingjie, Shen, Zhiqiang
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.07093
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914863345827840
author Ma, Liqun
Sun, Mingjie
Shen, Zhiqiang
author_facet Ma, Liqun
Sun, Mingjie
Shen, Zhiqiang
contents This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary LLM like BitNet b1.58) to match the performance of its full-precision counterparts (e.g., FP16 or BF16) in transformer-based LLMs. It achieves this by employing an autoregressive distillation (AD) loss with maintaining equivalent model dimensions (130M, 1.3B, 7B) and training data volume as regular LLM pretraining, while delivering competitive results in terms of perplexity and task-specific effectiveness. Intriguingly, by analyzing the training trajectory, we find that the pretrained weight is not necessary for training binarized LLMs from scratch. This research encourages a new computational framework and may facilitate the future design of specialized hardware tailored for fully 1-bit LLMs. We make all models, code, and training dataset fully accessible and transparent to support further research (Code: https://github.com/LiqunMa/FBI-LLM. Model: https://huggingface.co/LiqunMa/).
format Preprint
id arxiv_https___arxiv_org_abs_2407_07093
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
Ma, Liqun
Sun, Mingjie
Shen, Zhiqiang
Computation and Language
Artificial Intelligence
Machine Learning
This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary LLM like BitNet b1.58) to match the performance of its full-precision counterparts (e.g., FP16 or BF16) in transformer-based LLMs. It achieves this by employing an autoregressive distillation (AD) loss with maintaining equivalent model dimensions (130M, 1.3B, 7B) and training data volume as regular LLM pretraining, while delivering competitive results in terms of perplexity and task-specific effectiveness. Intriguingly, by analyzing the training trajectory, we find that the pretrained weight is not necessary for training binarized LLMs from scratch. This research encourages a new computational framework and may facilitate the future design of specialized hardware tailored for fully 1-bit LLMs. We make all models, code, and training dataset fully accessible and transparent to support further research (Code: https://github.com/LiqunMa/FBI-LLM. Model: https://huggingface.co/LiqunMa/).
title FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
topic Computation and Language
Artificial Intelligence
Machine Learning
url https://arxiv.org/abs/2407.07093