Salvato in:
Dettagli Bibliografici
Autori principali: Chen, Guoxin, Chen, Jie, Chen, Lei, Zhao, Jiale, Meng, Fanzhe, Zhao, Wayne Xin, Song, Ruihua, Chen, Cheng, Wen, Ji-Rong, Jia, Kai
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2604.13018
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866916048895213568
author Chen, Guoxin
Chen, Jie
Chen, Lei
Zhao, Jiale
Meng, Fanzhe
Zhao, Wayne Xin
Song, Ruihua
Chen, Cheng
Wen, Ji-Rong
Jia, Kai
author_facet Chen, Guoxin
Chen, Jie
Chen, Lei
Zhao, Jiale
Meng, Fanzhe
Zhao, Wayne Xin
Song, Ruihua
Chen, Cheng
Wen, Ji-Rong
Jia, Kai
contents Agentic systems increasingly automate pieces of AI research. Yet turning underspecified research objectives into runnable, experimentally validated ML systems remains a central bottleneck. We study this operational setting as \emph{long-horizon ML research engineering}: converting a research specification into a runnable ML system through repeated implementation, experimentation, and refinement. The central challenge is to sustain cumulative project progress across heterogeneous stages under delayed, confounded feedback. We introduce AiScientist, a multi-agent system built around thin control over thick state: a lightweight hierarchical research team coordinates through a File-as-Bus workspace that preserves decision-relevant artifacts across roles and invocations. On PaperBench, AiScientist improves over the strongest matched baselines by 9.92 and 11.15 points with Gemini-3-Flash and GLM-5, respectively. On MLE-Bench Lite, it reaches 81.82 Any Medal\% under both backbones, improving over the strongest matched baselines by 4.55 and 16.67 points, and exceeding a Codex/GPT-5.5 xhigh frontier harness reference by 13.64 Any Medal points. Ablations and process analyses show that durable project state is central to later-round refinement: removing File-as-Bus lowers PaperBench score by 6.41 points and MLE-Bench Lite Any Medal\% by 31.82 points. These results suggest that long-horizon AI research is not only a problem of stronger local reasoning, but a systems problem of maintaining cumulative, inspectable project progress.
format Preprint
id arxiv_https___arxiv_org_abs_2604_13018
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Toward Autonomous Long-Horizon Engineering for ML Research
Chen, Guoxin
Chen, Jie
Chen, Lei
Zhao, Jiale
Meng, Fanzhe
Zhao, Wayne Xin
Song, Ruihua
Chen, Cheng
Wen, Ji-Rong
Jia, Kai
Computation and Language
Agentic systems increasingly automate pieces of AI research. Yet turning underspecified research objectives into runnable, experimentally validated ML systems remains a central bottleneck. We study this operational setting as \emph{long-horizon ML research engineering}: converting a research specification into a runnable ML system through repeated implementation, experimentation, and refinement. The central challenge is to sustain cumulative project progress across heterogeneous stages under delayed, confounded feedback. We introduce AiScientist, a multi-agent system built around thin control over thick state: a lightweight hierarchical research team coordinates through a File-as-Bus workspace that preserves decision-relevant artifacts across roles and invocations. On PaperBench, AiScientist improves over the strongest matched baselines by 9.92 and 11.15 points with Gemini-3-Flash and GLM-5, respectively. On MLE-Bench Lite, it reaches 81.82 Any Medal\% under both backbones, improving over the strongest matched baselines by 4.55 and 16.67 points, and exceeding a Codex/GPT-5.5 xhigh frontier harness reference by 13.64 Any Medal points. Ablations and process analyses show that durable project state is central to later-round refinement: removing File-as-Bus lowers PaperBench score by 6.41 points and MLE-Bench Lite Any Medal\% by 31.82 points. These results suggest that long-horizon AI research is not only a problem of stronger local reasoning, but a systems problem of maintaining cumulative, inspectable project progress.
title Toward Autonomous Long-Horizon Engineering for ML Research
topic Computation and Language
url https://arxiv.org/abs/2604.13018