Saved in:
Bibliographic Details
Main Authors: Liu, Zeyu, Li, Yan, Zhang, Yunquan, Zhang, Boyang, Jiang, Guoyong, Zhang, Xin, Xiao, Limin, Zhang, Weifeng, Cheng, Daning
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2506.12037
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Training large language models typically demands extensive GPU memory and substantial financial investment, which poses a barrier for many small- to medium-sized teams. In this paper, we propose a full-parameter pre-training and fine-tuning framework based on block coordinate descent (BCD), enhanced with engineering optimizations, to enable efficient training of large-scale models on cost-effective RTX 4090, A100 and A800 GPU clusters. Under identical hardware configurations, we reduce the training cost of a 7B model to 33% on A100/A800 and only 2.6% on RTX 4090, compared to standard full-parameter training. It also enables large models previously restricted to A100 clusters to be trained on RTX 4090 without degrading performance. BCD achieves comparable or better accuracy than full-parameter and fine-tuning methods at most cases, with lower GPU consumption and improved hardware utilization.