Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Chen, Bo, Cheng, Xingyi, Li, Pan, Geng, Yangli-ao, Gong, Jing, Li, Shen, Bei, Zhilei, Tan, Xu, Wang, Boyan, Zeng, Xin, Liu, Chiming, Zeng, Aohan, Dong, Yuxiao, Tang, Jie, Song, Le
Format: Preprint
Veröffentlicht: 2024
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2401.06199
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866915053898301440
author Chen, Bo
Cheng, Xingyi
Li, Pan
Geng, Yangli-ao
Gong, Jing
Li, Shen
Bei, Zhilei
Tan, Xu
Wang, Boyan
Zeng, Xin
Liu, Chiming
Zeng, Aohan
Dong, Yuxiao
Tang, Jie
Song, Le
author_facet Chen, Bo
Cheng, Xingyi
Li, Pan
Geng, Yangli-ao
Gong, Jing
Li, Shen
Bei, Zhilei
Tan, Xu
Wang, Boyan
Zeng, Xin
Liu, Chiming
Zeng, Aohan
Dong, Yuxiao
Tang, Jie
Song, Le
contents Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.
format Preprint
id arxiv_https___arxiv_org_abs_2401_06199
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Chen, Bo
Cheng, Xingyi
Li, Pan
Geng, Yangli-ao
Gong, Jing
Li, Shen
Bei, Zhilei
Tan, Xu
Wang, Boyan
Zeng, Xin
Liu, Chiming
Zeng, Aohan
Dong, Yuxiao
Tang, Jie
Song, Le
Quantitative Methods
Artificial Intelligence
Machine Learning
Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.
title xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
topic Quantitative Methods
Artificial Intelligence
Machine Learning
url https://arxiv.org/abs/2401.06199