Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Hengkui, Liu, Liujiang, He, Jihua, Wang, Qihao, Zhao, Keke, Hu, Shuyang, Fu, Renle, Liang, Dahao, Zeng, Lingyu, Liu, Bruce, Liu, Yuan, Zhan, Jin, Niu, Jiaqiang, Jia, Xinglong, Hu, Yaqin, Ji, Wenjun, Chi, Panpan, Chen, Ken, Wu, Hengyuan, Xin, Yingsi, Zhu, Yongfeng, Wang, Yuexin, Ruan, Manqi, Bian, Ningtao, Wu, Xiaohua, Xu, Weipeng
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Materials Science Artificial Intelligence Computational Physics 68T05, 68T50, 00A69, 94A99 I.2.6; I.2.7; J.2; I.6.3; K.4.1
Online Access:	https://arxiv.org/abs/2510.00129
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916981633974272
author	Wu, Hengkui Liu, Liujiang He, Jihua Wang, Qihao Zhao, Keke Hu, Shuyang Fu, Renle Liang, Dahao Zeng, Lingyu Liu, Bruce Liu, Yuan Zhan, Jin Niu, Jiaqiang Jia, Xinglong Hu, Yaqin Ji, Wenjun Chi, Panpan Chen, Ken Wu, Hengyuan Xin, Yingsi Zhu, Yongfeng Wang, Yuexin Ruan, Manqi Bian, Ningtao Wu, Xiaohua Xu, Weipeng
author_facet	Wu, Hengkui Liu, Liujiang He, Jihua Wang, Qihao Zhao, Keke Hu, Shuyang Fu, Renle Liang, Dahao Zeng, Lingyu Liu, Bruce Liu, Yuan Zhan, Jin Niu, Jiaqiang Jia, Xinglong Hu, Yaqin Ji, Wenjun Chi, Panpan Chen, Ken Wu, Hengyuan Xin, Yingsi Zhu, Yongfeng Wang, Yuexin Ruan, Manqi Bian, Ningtao Wu, Xiaohua Xu, Weipeng
contents	We introduce BigBang-Proton, a unified sequence-based architecture for auto-regressive language modeling pretrained on cross-scale, cross-structure, cross-discipline real-world scientific tasks to construct a scientific multi-task learner. BigBang-Proton incorporates three fundamental innovations compared to mainstream general-purpose LLMs: Theory-Experiment Learning paradigm aligns large-scale numerical experimental data with theoretical text corpora; Binary Patch Encoding replaces byte pair encoding(BPE) tokenization; Monte Carlo Attention substitutes traditional transformer architectures. Through next-word-prediction pretraining on cross-discipline scientific datasets of real-world problems mixed with general textual corpus, followed by fine-tuning and inference on downstream tasks, BigBang-Proton demonstrates 100\% accuracy in up to 50-digit arithmetic addition operations, performance on par with leading specialized models in particle physics jet tagging, matching MAE of specialized models in inter-atomic potential simulation, performance comparable to traditional spatiotemporal models in water quality prediction, and benchmark-exceeding performance in genome modeling. These results prove that language-guided scientific computing can match or exceed the performance of task-specific scientific models while maintaining multitask learning capabilities. We further hypothesize to scale the pretraining to the universe scale as a fundamental step toward developing material world foundational model.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_00129
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	BigBang-Proton Technical Report: Next-Word-Prediction is Scientific Multitask Learner Wu, Hengkui Liu, Liujiang He, Jihua Wang, Qihao Zhao, Keke Hu, Shuyang Fu, Renle Liang, Dahao Zeng, Lingyu Liu, Bruce Liu, Yuan Zhan, Jin Niu, Jiaqiang Jia, Xinglong Hu, Yaqin Ji, Wenjun Chi, Panpan Chen, Ken Wu, Hengyuan Xin, Yingsi Zhu, Yongfeng Wang, Yuexin Ruan, Manqi Bian, Ningtao Wu, Xiaohua Xu, Weipeng Machine Learning Materials Science Artificial Intelligence Computational Physics 68T05, 68T50, 00A69, 94A99 I.2.6; I.2.7; J.2; I.6.3; K.4.1 We introduce BigBang-Proton, a unified sequence-based architecture for auto-regressive language modeling pretrained on cross-scale, cross-structure, cross-discipline real-world scientific tasks to construct a scientific multi-task learner. BigBang-Proton incorporates three fundamental innovations compared to mainstream general-purpose LLMs: Theory-Experiment Learning paradigm aligns large-scale numerical experimental data with theoretical text corpora; Binary Patch Encoding replaces byte pair encoding(BPE) tokenization; Monte Carlo Attention substitutes traditional transformer architectures. Through next-word-prediction pretraining on cross-discipline scientific datasets of real-world problems mixed with general textual corpus, followed by fine-tuning and inference on downstream tasks, BigBang-Proton demonstrates 100\% accuracy in up to 50-digit arithmetic addition operations, performance on par with leading specialized models in particle physics jet tagging, matching MAE of specialized models in inter-atomic potential simulation, performance comparable to traditional spatiotemporal models in water quality prediction, and benchmark-exceeding performance in genome modeling. These results prove that language-guided scientific computing can match or exceed the performance of task-specific scientific models while maintaining multitask learning capabilities. We further hypothesize to scale the pretraining to the universe scale as a fundamental step toward developing material world foundational model.
title	BigBang-Proton Technical Report: Next-Word-Prediction is Scientific Multitask Learner
topic	Machine Learning Materials Science Artificial Intelligence Computational Physics 68T05, 68T50, 00A69, 94A99 I.2.6; I.2.7; J.2; I.6.3; K.4.1
url	https://arxiv.org/abs/2510.00129

Similar Items