Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.21148 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912656497049600 |
|---|---|
| author | Hu, Ming Ma, Chenglong Li, Wei Xu, Wanghan Wu, Jiamin Hu, Jucheng Li, Tianbin Zhuang, Guohang Liu, Jiaqi Lu, Yingzhou Chen, Ying Zhang, Chaoyang Tan, Cheng Ying, Jie Wu, Guocheng Gao, Shujian Chen, Pengcheng Lin, Jiashi Wu, Haitao Chen, Lulu Wang, Fengxiang Zhang, Yuanyuan Zhao, Xiangyu Tang, Feilong Su, Encheng Ning, Junzhi Liu, Xinyao Du, Ye Ji, Changkai Jiang, Pengfei Tang, Cheng Huang, Ziyan Liu, Jiyao Wei, Jiaqi Yang, Yuejin Zhang, Xiang Wang, Guangshuai Yang, Yue Xu, Huihui Chen, Ziyang Wang, Yizhou Tang, Chen Wu, Jianyu Ren, Yuchen Yan, Siyuan Wang, Zhonghua Xu, Zhongxing Su, Shiyan Sun, Shangquan Zhao, Runkai Zhang, Zhisheng Yang, Dingkang Wei, Jinjie Wang, Jiaqi Xu, Jiahao Yan, Jiangtao Tang, Wenhao Zhu, Hongze Liu, Yu Wang, Fudi Shen, Yiqing Ji, Yuanfeng Su, Yanzhou Xie, Tong Shan, Hongming Feng, Chun-Mei Hou, Zhi Song, Diping Liu, Lihao Huang, Yanyan Yu, Lequan Fu, Bin Wang, Shujun Li, Xiaomeng Hu, Xiaowei Gu, Yun Fei, Ben Wang, Benyou Cao, Yuewen Shen, Minjie Xu, Jie Duan, Haodong Yan, Fang Hao, Hongxia Li, Jielan Du, Jiajun Wang, Yanbo Razzak, Imran Deng, Zhongying Zhang, Chi Wu, Lijun He, Conghui Lu, Zhaohui Huang, Jinhai Shao, Wenqi Liu, Yihao Luo, Siqi Xin, Yi Liu, Xiaohong Ling, Fenghua Li, Yuqiang Wang, Aoran Sun, Siqi Zheng, Qihao Dong, Nanqing Fu, Tianfan Zhou, Dongzhan Lu, Yan Zhang, Wenlong Ye, Jin Cai, Jianfei Chen, Yirong Ouyang, Wanli Qiao, Yu Ge, Zongyuan Tang, Shixiang He, Junjun Song, Chunfeng Bai, Lei Zhou, Bowen |
| author_facet | Hu, Ming Ma, Chenglong Li, Wei Xu, Wanghan Wu, Jiamin Hu, Jucheng Li, Tianbin Zhuang, Guohang Liu, Jiaqi Lu, Yingzhou Chen, Ying Zhang, Chaoyang Tan, Cheng Ying, Jie Wu, Guocheng Gao, Shujian Chen, Pengcheng Lin, Jiashi Wu, Haitao Chen, Lulu Wang, Fengxiang Zhang, Yuanyuan Zhao, Xiangyu Tang, Feilong Su, Encheng Ning, Junzhi Liu, Xinyao Du, Ye Ji, Changkai Jiang, Pengfei Tang, Cheng Huang, Ziyan Liu, Jiyao Wei, Jiaqi Yang, Yuejin Zhang, Xiang Wang, Guangshuai Yang, Yue Xu, Huihui Chen, Ziyang Wang, Yizhou Tang, Chen Wu, Jianyu Ren, Yuchen Yan, Siyuan Wang, Zhonghua Xu, Zhongxing Su, Shiyan Sun, Shangquan Zhao, Runkai Zhang, Zhisheng Yang, Dingkang Wei, Jinjie Wang, Jiaqi Xu, Jiahao Yan, Jiangtao Tang, Wenhao Zhu, Hongze Liu, Yu Wang, Fudi Shen, Yiqing Ji, Yuanfeng Su, Yanzhou Xie, Tong Shan, Hongming Feng, Chun-Mei Hou, Zhi Song, Diping Liu, Lihao Huang, Yanyan Yu, Lequan Fu, Bin Wang, Shujun Li, Xiaomeng Hu, Xiaowei Gu, Yun Fei, Ben Wang, Benyou Cao, Yuewen Shen, Minjie Xu, Jie Duan, Haodong Yan, Fang Hao, Hongxia Li, Jielan Du, Jiajun Wang, Yanbo Razzak, Imran Deng, Zhongying Zhang, Chi Wu, Lijun He, Conghui Lu, Zhaohui Huang, Jinhai Shao, Wenqi Liu, Yihao Luo, Siqi Xin, Yi Liu, Xiaohong Ling, Fenghua Li, Yuqiang Wang, Aoran Sun, Siqi Zheng, Qihao Dong, Nanqing Fu, Tianfan Zhou, Dongzhan Lu, Yan Zhang, Wenlong Ye, Jin Cai, Jianfei Chen, Yirong Ouyang, Wanli Qiao, Yu Ge, Zongyuan Tang, Shixiang He, Junjun Song, Chunfeng Bai, Lei Zhou, Bowen |
| contents | Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2508_21148 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers Hu, Ming Ma, Chenglong Li, Wei Xu, Wanghan Wu, Jiamin Hu, Jucheng Li, Tianbin Zhuang, Guohang Liu, Jiaqi Lu, Yingzhou Chen, Ying Zhang, Chaoyang Tan, Cheng Ying, Jie Wu, Guocheng Gao, Shujian Chen, Pengcheng Lin, Jiashi Wu, Haitao Chen, Lulu Wang, Fengxiang Zhang, Yuanyuan Zhao, Xiangyu Tang, Feilong Su, Encheng Ning, Junzhi Liu, Xinyao Du, Ye Ji, Changkai Jiang, Pengfei Tang, Cheng Huang, Ziyan Liu, Jiyao Wei, Jiaqi Yang, Yuejin Zhang, Xiang Wang, Guangshuai Yang, Yue Xu, Huihui Chen, Ziyang Wang, Yizhou Tang, Chen Wu, Jianyu Ren, Yuchen Yan, Siyuan Wang, Zhonghua Xu, Zhongxing Su, Shiyan Sun, Shangquan Zhao, Runkai Zhang, Zhisheng Yang, Dingkang Wei, Jinjie Wang, Jiaqi Xu, Jiahao Yan, Jiangtao Tang, Wenhao Zhu, Hongze Liu, Yu Wang, Fudi Shen, Yiqing Ji, Yuanfeng Su, Yanzhou Xie, Tong Shan, Hongming Feng, Chun-Mei Hou, Zhi Song, Diping Liu, Lihao Huang, Yanyan Yu, Lequan Fu, Bin Wang, Shujun Li, Xiaomeng Hu, Xiaowei Gu, Yun Fei, Ben Wang, Benyou Cao, Yuewen Shen, Minjie Xu, Jie Duan, Haodong Yan, Fang Hao, Hongxia Li, Jielan Du, Jiajun Wang, Yanbo Razzak, Imran Deng, Zhongying Zhang, Chi Wu, Lijun He, Conghui Lu, Zhaohui Huang, Jinhai Shao, Wenqi Liu, Yihao Luo, Siqi Xin, Yi Liu, Xiaohong Ling, Fenghua Li, Yuqiang Wang, Aoran Sun, Siqi Zheng, Qihao Dong, Nanqing Fu, Tianfan Zhou, Dongzhan Lu, Yan Zhang, Wenlong Ye, Jin Cai, Jianfei Chen, Yirong Ouyang, Wanli Qiao, Yu Ge, Zongyuan Tang, Shixiang He, Junjun Song, Chunfeng Bai, Lei Zhou, Bowen Computation and Language Artificial Intelligence Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery. |
| title | A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers |
| topic | Computation and Language Artificial Intelligence |
| url | https://arxiv.org/abs/2508.21148 |