Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cheng, Xingyi, Chen, Bo, Li, Pan, Gong, Jing, Tang, Jie, Song, Le
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Artificial Intelligence Quantitative Methods
Online Access:	https://arxiv.org/abs/2411.02142
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910683548876800
author	Cheng, Xingyi Chen, Bo Li, Pan Gong, Jing Tang, Jie Song, Le
author_facet	Cheng, Xingyi Chen, Bo Li, Pan Gong, Jing Tang, Jie Song, Le
contents	We explore optimally training protein language models, an area of significant interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains plateau, focusing primarily on increasing model sizes rather than optimizing the efficient compute frontier that balances performance and compute budgets. Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives. First, we observed the effect of diminishing returns for the Causal Language Model (CLM) and that of overfitting for the Masked Language Model~(MLM) when repeating the commonly used Uniref database. To address this, we included metagenomic protein sequences in the training set to increase the diversity and avoid the plateau or overfitting effects. Second, we obtained the scaling laws of CLM and MLM on Transformer, tailored to the specific characteristics of protein sequence data. Third, we observe a transfer scaling phenomenon from CLM to MLM, further demonstrating the effectiveness of transfer through scaling behaviors based on estimated Effectively Transferred Tokens. Finally, to validate our scaling laws, we compare the large-scale versions of ESM-2 and PROGEN2 on downstream tasks, encompassing evaluations of protein generation as well as structure- and function-related tasks, all within less or equivalent pre-training compute budgets.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_02142
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Training Compute-Optimal Protein Language Models Cheng, Xingyi Chen, Bo Li, Pan Gong, Jing Tang, Jie Song, Le Machine Learning Artificial Intelligence Quantitative Methods We explore optimally training protein language models, an area of significant interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains plateau, focusing primarily on increasing model sizes rather than optimizing the efficient compute frontier that balances performance and compute budgets. Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives. First, we observed the effect of diminishing returns for the Causal Language Model (CLM) and that of overfitting for the Masked Language Model~(MLM) when repeating the commonly used Uniref database. To address this, we included metagenomic protein sequences in the training set to increase the diversity and avoid the plateau or overfitting effects. Second, we obtained the scaling laws of CLM and MLM on Transformer, tailored to the specific characteristics of protein sequence data. Third, we observe a transfer scaling phenomenon from CLM to MLM, further demonstrating the effectiveness of transfer through scaling behaviors based on estimated Effectively Transferred Tokens. Finally, to validate our scaling laws, we compare the large-scale versions of ESM-2 and PROGEN2 on downstream tasks, encompassing evaluations of protein generation as well as structure- and function-related tasks, all within less or equivalent pre-training compute budgets.
title	Training Compute-Optimal Protein Language Models
topic	Machine Learning Artificial Intelligence Quantitative Methods
url	https://arxiv.org/abs/2411.02142

Similar Items