Saved in:
Bibliographic Details
Main Authors: Hu, Ming, Ma, Chenglong, Li, Wei, Xu, Wanghan, Wu, Jiamin, Hu, Jucheng, Li, Tianbin, Zhuang, Guohang, Liu, Jiaqi, Lu, Yingzhou, Chen, Ying, Zhang, Chaoyang, Tan, Cheng, Ying, Jie, Wu, Guocheng, Gao, Shujian, Chen, Pengcheng, Lin, Jiashi, Wu, Haitao, Chen, Lulu, Wang, Fengxiang, Zhang, Yuanyuan, Zhao, Xiangyu, Tang, Feilong, Su, Encheng, Ning, Junzhi, Liu, Xinyao, Du, Ye, Ji, Changkai, Jiang, Pengfei, Tang, Cheng, Huang, Ziyan, Liu, Jiyao, Wei, Jiaqi, Yang, Yuejin, Zhang, Xiang, Wang, Guangshuai, Yang, Yue, Xu, Huihui, Chen, Ziyang, Wang, Yizhou, Tang, Chen, Wu, Jianyu, Ren, Yuchen, Yan, Siyuan, Wang, Zhonghua, Xu, Zhongxing, Su, Shiyan, Sun, Shangquan, Zhao, Runkai, Zhang, Zhisheng, Yang, Dingkang, Wei, Jinjie, Wang, Jiaqi, Xu, Jiahao, Yan, Jiangtao, Tang, Wenhao, Zhu, Hongze, Liu, Yu, Wang, Fudi, Shen, Yiqing, Ji, Yuanfeng, Su, Yanzhou, Xie, Tong, Shan, Hongming, Feng, Chun-Mei, Hou, Zhi, Song, Diping, Liu, Lihao, Huang, Yanyan, Yu, Lequan, Fu, Bin, Wang, Shujun, Li, Xiaomeng, Hu, Xiaowei, Gu, Yun, Fei, Ben, Wang, Benyou, Cao, Yuewen, Shen, Minjie, Xu, Jie, Duan, Haodong, Yan, Fang, Hao, Hongxia, Li, Jielan, Du, Jiajun, Wang, Yanbo, Razzak, Imran, Deng, Zhongying, Zhang, Chi, Wu, Lijun, He, Conghui, Lu, Zhaohui, Huang, Jinhai, Shao, Wenqi, Liu, Yihao, Luo, Siqi, Xin, Yi, Liu, Xiaohong, Ling, Fenghua, Li, Yuqiang, Wang, Aoran, Sun, Siqi, Zheng, Qihao, Dong, Nanqing, Fu, Tianfan, Zhou, Dongzhan, Lu, Yan, Zhang, Wenlong, Ye, Jin, Cai, Jianfei, Chen, Yirong, Ouyang, Wanli, Qiao, Yu, Ge, Zongyuan, Tang, Shixiang, He, Junjun, Song, Chunfeng, Bai, Lei, Zhou, Bowen
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.21148
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912656497049600
author Hu, Ming
Ma, Chenglong
Li, Wei
Xu, Wanghan
Wu, Jiamin
Hu, Jucheng
Li, Tianbin
Zhuang, Guohang
Liu, Jiaqi
Lu, Yingzhou
Chen, Ying
Zhang, Chaoyang
Tan, Cheng
Ying, Jie
Wu, Guocheng
Gao, Shujian
Chen, Pengcheng
Lin, Jiashi
Wu, Haitao
Chen, Lulu
Wang, Fengxiang
Zhang, Yuanyuan
Zhao, Xiangyu
Tang, Feilong
Su, Encheng
Ning, Junzhi
Liu, Xinyao
Du, Ye
Ji, Changkai
Jiang, Pengfei
Tang, Cheng
Huang, Ziyan
Liu, Jiyao
Wei, Jiaqi
Yang, Yuejin
Zhang, Xiang
Wang, Guangshuai
Yang, Yue
Xu, Huihui
Chen, Ziyang
Wang, Yizhou
Tang, Chen
Wu, Jianyu
Ren, Yuchen
Yan, Siyuan
Wang, Zhonghua
Xu, Zhongxing
Su, Shiyan
Sun, Shangquan
Zhao, Runkai
Zhang, Zhisheng
Yang, Dingkang
Wei, Jinjie
Wang, Jiaqi
Xu, Jiahao
Yan, Jiangtao
Tang, Wenhao
Zhu, Hongze
Liu, Yu
Wang, Fudi
Shen, Yiqing
Ji, Yuanfeng
Su, Yanzhou
Xie, Tong
Shan, Hongming
Feng, Chun-Mei
Hou, Zhi
Song, Diping
Liu, Lihao
Huang, Yanyan
Yu, Lequan
Fu, Bin
Wang, Shujun
Li, Xiaomeng
Hu, Xiaowei
Gu, Yun
Fei, Ben
Wang, Benyou
Cao, Yuewen
Shen, Minjie
Xu, Jie
Duan, Haodong
Yan, Fang
Hao, Hongxia
Li, Jielan
Du, Jiajun
Wang, Yanbo
Razzak, Imran
Deng, Zhongying
Zhang, Chi
Wu, Lijun
He, Conghui
Lu, Zhaohui
Huang, Jinhai
Shao, Wenqi
Liu, Yihao
Luo, Siqi
Xin, Yi
Liu, Xiaohong
Ling, Fenghua
Li, Yuqiang
Wang, Aoran
Sun, Siqi
Zheng, Qihao
Dong, Nanqing
Fu, Tianfan
Zhou, Dongzhan
Lu, Yan
Zhang, Wenlong
Ye, Jin
Cai, Jianfei
Chen, Yirong
Ouyang, Wanli
Qiao, Yu
Ge, Zongyuan
Tang, Shixiang
He, Junjun
Song, Chunfeng
Bai, Lei
Zhou, Bowen
author_facet Hu, Ming
Ma, Chenglong
Li, Wei
Xu, Wanghan
Wu, Jiamin
Hu, Jucheng
Li, Tianbin
Zhuang, Guohang
Liu, Jiaqi
Lu, Yingzhou
Chen, Ying
Zhang, Chaoyang
Tan, Cheng
Ying, Jie
Wu, Guocheng
Gao, Shujian
Chen, Pengcheng
Lin, Jiashi
Wu, Haitao
Chen, Lulu
Wang, Fengxiang
Zhang, Yuanyuan
Zhao, Xiangyu
Tang, Feilong
Su, Encheng
Ning, Junzhi
Liu, Xinyao
Du, Ye
Ji, Changkai
Jiang, Pengfei
Tang, Cheng
Huang, Ziyan
Liu, Jiyao
Wei, Jiaqi
Yang, Yuejin
Zhang, Xiang
Wang, Guangshuai
Yang, Yue
Xu, Huihui
Chen, Ziyang
Wang, Yizhou
Tang, Chen
Wu, Jianyu
Ren, Yuchen
Yan, Siyuan
Wang, Zhonghua
Xu, Zhongxing
Su, Shiyan
Sun, Shangquan
Zhao, Runkai
Zhang, Zhisheng
Yang, Dingkang
Wei, Jinjie
Wang, Jiaqi
Xu, Jiahao
Yan, Jiangtao
Tang, Wenhao
Zhu, Hongze
Liu, Yu
Wang, Fudi
Shen, Yiqing
Ji, Yuanfeng
Su, Yanzhou
Xie, Tong
Shan, Hongming
Feng, Chun-Mei
Hou, Zhi
Song, Diping
Liu, Lihao
Huang, Yanyan
Yu, Lequan
Fu, Bin
Wang, Shujun
Li, Xiaomeng
Hu, Xiaowei
Gu, Yun
Fei, Ben
Wang, Benyou
Cao, Yuewen
Shen, Minjie
Xu, Jie
Duan, Haodong
Yan, Fang
Hao, Hongxia
Li, Jielan
Du, Jiajun
Wang, Yanbo
Razzak, Imran
Deng, Zhongying
Zhang, Chi
Wu, Lijun
He, Conghui
Lu, Zhaohui
Huang, Jinhai
Shao, Wenqi
Liu, Yihao
Luo, Siqi
Xin, Yi
Liu, Xiaohong
Ling, Fenghua
Li, Yuqiang
Wang, Aoran
Sun, Siqi
Zheng, Qihao
Dong, Nanqing
Fu, Tianfan
Zhou, Dongzhan
Lu, Yan
Zhang, Wenlong
Ye, Jin
Cai, Jianfei
Chen, Yirong
Ouyang, Wanli
Qiao, Yu
Ge, Zongyuan
Tang, Shixiang
He, Junjun
Song, Chunfeng
Bai, Lei
Zhou, Bowen
contents Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
format Preprint
id arxiv_https___arxiv_org_abs_2508_21148
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
Hu, Ming
Ma, Chenglong
Li, Wei
Xu, Wanghan
Wu, Jiamin
Hu, Jucheng
Li, Tianbin
Zhuang, Guohang
Liu, Jiaqi
Lu, Yingzhou
Chen, Ying
Zhang, Chaoyang
Tan, Cheng
Ying, Jie
Wu, Guocheng
Gao, Shujian
Chen, Pengcheng
Lin, Jiashi
Wu, Haitao
Chen, Lulu
Wang, Fengxiang
Zhang, Yuanyuan
Zhao, Xiangyu
Tang, Feilong
Su, Encheng
Ning, Junzhi
Liu, Xinyao
Du, Ye
Ji, Changkai
Jiang, Pengfei
Tang, Cheng
Huang, Ziyan
Liu, Jiyao
Wei, Jiaqi
Yang, Yuejin
Zhang, Xiang
Wang, Guangshuai
Yang, Yue
Xu, Huihui
Chen, Ziyang
Wang, Yizhou
Tang, Chen
Wu, Jianyu
Ren, Yuchen
Yan, Siyuan
Wang, Zhonghua
Xu, Zhongxing
Su, Shiyan
Sun, Shangquan
Zhao, Runkai
Zhang, Zhisheng
Yang, Dingkang
Wei, Jinjie
Wang, Jiaqi
Xu, Jiahao
Yan, Jiangtao
Tang, Wenhao
Zhu, Hongze
Liu, Yu
Wang, Fudi
Shen, Yiqing
Ji, Yuanfeng
Su, Yanzhou
Xie, Tong
Shan, Hongming
Feng, Chun-Mei
Hou, Zhi
Song, Diping
Liu, Lihao
Huang, Yanyan
Yu, Lequan
Fu, Bin
Wang, Shujun
Li, Xiaomeng
Hu, Xiaowei
Gu, Yun
Fei, Ben
Wang, Benyou
Cao, Yuewen
Shen, Minjie
Xu, Jie
Duan, Haodong
Yan, Fang
Hao, Hongxia
Li, Jielan
Du, Jiajun
Wang, Yanbo
Razzak, Imran
Deng, Zhongying
Zhang, Chi
Wu, Lijun
He, Conghui
Lu, Zhaohui
Huang, Jinhai
Shao, Wenqi
Liu, Yihao
Luo, Siqi
Xin, Yi
Liu, Xiaohong
Ling, Fenghua
Li, Yuqiang
Wang, Aoran
Sun, Siqi
Zheng, Qihao
Dong, Nanqing
Fu, Tianfan
Zhou, Dongzhan
Lu, Yan
Zhang, Wenlong
Ye, Jin
Cai, Jianfei
Chen, Yirong
Ouyang, Wanli
Qiao, Yu
Ge, Zongyuan
Tang, Shixiang
He, Junjun
Song, Chunfeng
Bai, Lei
Zhou, Bowen
Computation and Language
Artificial Intelligence
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
title A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2508.21148