Saved in:
Bibliographic Details
Main Authors: Hu, Ming, Ma, Chenglong, Li, Wei, Xu, Wanghan, Wu, Jiamin, Hu, Jucheng, Li, Tianbin, Zhuang, Guohang, Liu, Jiaqi, Lu, Yingzhou, Chen, Ying, Zhang, Chaoyang, Tan, Cheng, Ying, Jie, Wu, Guocheng, Gao, Shujian, Chen, Pengcheng, Lin, Jiashi, Wu, Haitao, Chen, Lulu, Wang, Fengxiang, Zhang, Yuanyuan, Zhao, Xiangyu, Tang, Feilong, Su, Encheng, Ning, Junzhi, Liu, Xinyao, Du, Ye, Ji, Changkai, Jiang, Pengfei, Tang, Cheng, Huang, Ziyan, Liu, Jiyao, Wei, Jiaqi, Yang, Yuejin, Zhang, Xiang, Wang, Guangshuai, Yang, Yue, Xu, Huihui, Chen, Ziyang, Wang, Yizhou, Tang, Chen, Wu, Jianyu, Ren, Yuchen, Yan, Siyuan, Wang, Zhonghua, Xu, Zhongxing, Su, Shiyan, Sun, Shangquan, Zhao, Runkai, Zhang, Zhisheng, Yang, Dingkang, Wei, Jinjie, Wang, Jiaqi, Xu, Jiahao, Yan, Jiangtao, Tang, Wenhao, Zhu, Hongze, Liu, Yu, Wang, Fudi, Shen, Yiqing, Ji, Yuanfeng, Su, Yanzhou, Xie, Tong, Shan, Hongming, Feng, Chun-Mei, Hou, Zhi, Song, Diping, Liu, Lihao, Huang, Yanyan, Yu, Lequan, Fu, Bin, Wang, Shujun, Li, Xiaomeng, Hu, Xiaowei, Gu, Yun, Fei, Ben, Wang, Benyou, Cao, Yuewen, Shen, Minjie, Xu, Jie, Duan, Haodong, Yan, Fang, Hao, Hongxia, Li, Jielan, Du, Jiajun, Wang, Yanbo, Razzak, Imran, Deng, Zhongying, Zhang, Chi, Wu, Lijun, He, Conghui, Lu, Zhaohui, Huang, Jinhai, Shao, Wenqi, Liu, Yihao, Luo, Siqi, Xin, Yi, Liu, Xiaohong, Ling, Fenghua, Li, Yuqiang, Wang, Aoran, Sun, Siqi, Zheng, Qihao, Dong, Nanqing, Fu, Tianfan, Zhou, Dongzhan, Lu, Yan, Zhang, Wenlong, Ye, Jin, Cai, Jianfei, Chen, Yirong, Ouyang, Wanli, Qiao, Yu, Ge, Zongyuan, Tang, Shixiang, He, Junjun, Song, Chunfeng, Bai, Lei, Zhou, Bowen
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.21148
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.