Guardado en:
Detalles Bibliográficos
Autores principales: Prabhakar, Vignesh, Islam, Md Amirul, Atanas, Adam, Wang, Yao-Ting, Han, Joah, Jhunjhunwala, Aastha, Apte, Rucha, Clark, Robert, Xu, Kang, Wang, Zihan, Liu, Kai
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2503.17604
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866915253630009344
author Prabhakar, Vignesh
Islam, Md Amirul
Atanas, Adam
Wang, Yao-Ting
Han, Joah
Jhunjhunwala, Aastha
Apte, Rucha
Clark, Robert
Xu, Kang
Wang, Zihan
Liu, Kai
author_facet Prabhakar, Vignesh
Islam, Md Amirul
Atanas, Adam
Wang, Yao-Ting
Han, Joah
Jhunjhunwala, Aastha
Apte, Rucha
Clark, Robert
Xu, Kang
Wang, Zihan
Liu, Kai
contents Large Language Models (LLMs) have demonstrated remarkable potential in advancing scientific knowledge and addressing complex challenges. In this work, we introduce OmniScience, a specialized large reasoning model for general science, developed through three key components: (1) domain adaptive pretraining on a carefully curated corpus of scientific literature, (2) instruction tuning on a specialized dataset to guide the model in following domain-specific tasks, and (3) reasoning-based knowledge distillation through fine-tuning to significantly enhance its ability to generate contextually relevant and logically sound responses. We demonstrate the versatility of OmniScience by developing a battery agent that efficiently ranks molecules as potential electrolyte solvents or additives. Comprehensive evaluations reveal that OmniScience is competitive with state-of-the-art large reasoning models on the GPQA Diamond and domain-specific battery benchmarks, while outperforming all public reasoning and non-reasoning models with similar parameter counts. We further demonstrate via ablation experiments that domain adaptive pretraining and reasoning-based knowledge distillation are critical to attain our performance levels, across benchmarks.
format Preprint
id arxiv_https___arxiv_org_abs_2503_17604
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle OmniScience: A Domain-Specialized LLM for Scientific Reasoning and Discovery
Prabhakar, Vignesh
Islam, Md Amirul
Atanas, Adam
Wang, Yao-Ting
Han, Joah
Jhunjhunwala, Aastha
Apte, Rucha
Clark, Robert
Xu, Kang
Wang, Zihan
Liu, Kai
Artificial Intelligence
Large Language Models (LLMs) have demonstrated remarkable potential in advancing scientific knowledge and addressing complex challenges. In this work, we introduce OmniScience, a specialized large reasoning model for general science, developed through three key components: (1) domain adaptive pretraining on a carefully curated corpus of scientific literature, (2) instruction tuning on a specialized dataset to guide the model in following domain-specific tasks, and (3) reasoning-based knowledge distillation through fine-tuning to significantly enhance its ability to generate contextually relevant and logically sound responses. We demonstrate the versatility of OmniScience by developing a battery agent that efficiently ranks molecules as potential electrolyte solvents or additives. Comprehensive evaluations reveal that OmniScience is competitive with state-of-the-art large reasoning models on the GPQA Diamond and domain-specific battery benchmarks, while outperforming all public reasoning and non-reasoning models with similar parameter counts. We further demonstrate via ablation experiments that domain adaptive pretraining and reasoning-based knowledge distillation are critical to attain our performance levels, across benchmarks.
title OmniScience: A Domain-Specialized LLM for Scientific Reasoning and Discovery
topic Artificial Intelligence
url https://arxiv.org/abs/2503.17604