Saved in:
Bibliographic Details
Main Authors: Cheng, Xingyi, Chen, Bo, Li, Pan, Gong, Jing, Tang, Jie, Song, Le
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2411.02142
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910683548876800
author Cheng, Xingyi
Chen, Bo
Li, Pan
Gong, Jing
Tang, Jie
Song, Le
author_facet Cheng, Xingyi
Chen, Bo
Li, Pan
Gong, Jing
Tang, Jie
Song, Le
contents We explore optimally training protein language models, an area of significant interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains plateau, focusing primarily on increasing model sizes rather than optimizing the efficient compute frontier that balances performance and compute budgets. Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives. First, we observed the effect of diminishing returns for the Causal Language Model (CLM) and that of overfitting for the Masked Language Model~(MLM) when repeating the commonly used Uniref database. To address this, we included metagenomic protein sequences in the training set to increase the diversity and avoid the plateau or overfitting effects. Second, we obtained the scaling laws of CLM and MLM on Transformer, tailored to the specific characteristics of protein sequence data. Third, we observe a transfer scaling phenomenon from CLM to MLM, further demonstrating the effectiveness of transfer through scaling behaviors based on estimated Effectively Transferred Tokens. Finally, to validate our scaling laws, we compare the large-scale versions of ESM-2 and PROGEN2 on downstream tasks, encompassing evaluations of protein generation as well as structure- and function-related tasks, all within less or equivalent pre-training compute budgets.
format Preprint
id arxiv_https___arxiv_org_abs_2411_02142
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Training Compute-Optimal Protein Language Models
Cheng, Xingyi
Chen, Bo
Li, Pan
Gong, Jing
Tang, Jie
Song, Le
Machine Learning
Artificial Intelligence
Quantitative Methods
We explore optimally training protein language models, an area of significant interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains plateau, focusing primarily on increasing model sizes rather than optimizing the efficient compute frontier that balances performance and compute budgets. Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives. First, we observed the effect of diminishing returns for the Causal Language Model (CLM) and that of overfitting for the Masked Language Model~(MLM) when repeating the commonly used Uniref database. To address this, we included metagenomic protein sequences in the training set to increase the diversity and avoid the plateau or overfitting effects. Second, we obtained the scaling laws of CLM and MLM on Transformer, tailored to the specific characteristics of protein sequence data. Third, we observe a transfer scaling phenomenon from CLM to MLM, further demonstrating the effectiveness of transfer through scaling behaviors based on estimated Effectively Transferred Tokens. Finally, to validate our scaling laws, we compare the large-scale versions of ESM-2 and PROGEN2 on downstream tasks, encompassing evaluations of protein generation as well as structure- and function-related tasks, all within less or equivalent pre-training compute budgets.
title Training Compute-Optimal Protein Language Models
topic Machine Learning
Artificial Intelligence
Quantitative Methods
url https://arxiv.org/abs/2411.02142