Saved in:
Bibliographic Details
Main Authors: Jansen, Peter, Tafjord, Oyvind, Radensky, Marissa, Siangliulue, Pao, Hope, Tom, Mishra, Bhavana Dalvi, Majumder, Bodhisattwa Prasad, Weld, Daniel S., Clark, Peter
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.22708
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909671211663360
author Jansen, Peter
Tafjord, Oyvind
Radensky, Marissa
Siangliulue, Pao
Hope, Tom
Mishra, Bhavana Dalvi
Majumder, Bodhisattwa Prasad
Weld, Daniel S.
Clark, Peter
author_facet Jansen, Peter
Tafjord, Oyvind
Radensky, Marissa
Siangliulue, Pao
Hope, Tom
Mishra, Bhavana Dalvi
Majumder, Bodhisattwa Prasad
Weld, Daniel S.
Clark, Peter
contents Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore variants of existing codebases or similarly constrained design spaces, and (2) they produce large volumes of research artifacts (such as automatically generated papers and code) that are typically evaluated using conference-style paper review with limited evaluation of code. In this work we introduce CodeScientist, a novel ASD system that frames ideation and experiment construction as a form of genetic search jointly over combinations of research articles and codeblocks defining common actions in a domain (like prompting a language model). We use this paradigm to conduct hundreds of automated experiments on machine-generated ideas broadly in the domain of agents and virtual environments, with the system returning 19 discoveries, 6 of which were judged as being both at least minimally sound and incrementally novel after a multi-faceted evaluation beyond that typically conducted in prior work, including external (conference-style) review, code review, and replication attempts. Moreover, the discoveries span new tasks, agents, metrics, and data, suggesting a qualitative shift from benchmark optimization to broader discoveries.
format Preprint
id arxiv_https___arxiv_org_abs_2503_22708
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation
Jansen, Peter
Tafjord, Oyvind
Radensky, Marissa
Siangliulue, Pao
Hope, Tom
Mishra, Bhavana Dalvi
Majumder, Bodhisattwa Prasad
Weld, Daniel S.
Clark, Peter
Artificial Intelligence
Computation and Language
Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore variants of existing codebases or similarly constrained design spaces, and (2) they produce large volumes of research artifacts (such as automatically generated papers and code) that are typically evaluated using conference-style paper review with limited evaluation of code. In this work we introduce CodeScientist, a novel ASD system that frames ideation and experiment construction as a form of genetic search jointly over combinations of research articles and codeblocks defining common actions in a domain (like prompting a language model). We use this paradigm to conduct hundreds of automated experiments on machine-generated ideas broadly in the domain of agents and virtual environments, with the system returning 19 discoveries, 6 of which were judged as being both at least minimally sound and incrementally novel after a multi-faceted evaluation beyond that typically conducted in prior work, including external (conference-style) review, code review, and replication attempts. Moreover, the discoveries span new tasks, agents, metrics, and data, suggesting a qualitative shift from benchmark optimization to broader discoveries.
title CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation
topic Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2503.22708