MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Chen, Guoxin, Chen, Jie, Chen, Lei, Zhao, Jiale, Meng, Fanzhe, Zhao, Wayne Xin, Song, Ruihua, Chen, Cheng, Wen, Ji-Rong, Jia, Kai
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Computation and Language
Accesso online:	https://arxiv.org/abs/2604.13018
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866916048895213568
author	Chen, Guoxin Chen, Jie Chen, Lei Zhao, Jiale Meng, Fanzhe Zhao, Wayne Xin Song, Ruihua Chen, Cheng Wen, Ji-Rong Jia, Kai
author_facet	Chen, Guoxin Chen, Jie Chen, Lei Zhao, Jiale Meng, Fanzhe Zhao, Wayne Xin Song, Ruihua Chen, Cheng Wen, Ji-Rong Jia, Kai
contents	Agentic systems increasingly automate pieces of AI research. Yet turning underspecified research objectives into runnable, experimentally validated ML systems remains a central bottleneck. We study this operational setting as \emph{long-horizon ML research engineering}: converting a research specification into a runnable ML system through repeated implementation, experimentation, and refinement. The central challenge is to sustain cumulative project progress across heterogeneous stages under delayed, confounded feedback. We introduce AiScientist, a multi-agent system built around thin control over thick state: a lightweight hierarchical research team coordinates through a File-as-Bus workspace that preserves decision-relevant artifacts across roles and invocations. On PaperBench, AiScientist improves over the strongest matched baselines by 9.92 and 11.15 points with Gemini-3-Flash and GLM-5, respectively. On MLE-Bench Lite, it reaches 81.82 Any Medal\% under both backbones, improving over the strongest matched baselines by 4.55 and 16.67 points, and exceeding a Codex/GPT-5.5 xhigh frontier harness reference by 13.64 Any Medal points. Ablations and process analyses show that durable project state is central to later-round refinement: removing File-as-Bus lowers PaperBench score by 6.41 points and MLE-Bench Lite Any Medal\% by 31.82 points. These results suggest that long-horizon AI research is not only a problem of stronger local reasoning, but a systems problem of maintaining cumulative, inspectable project progress.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_13018
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Toward Autonomous Long-Horizon Engineering for ML Research Chen, Guoxin Chen, Jie Chen, Lei Zhao, Jiale Meng, Fanzhe Zhao, Wayne Xin Song, Ruihua Chen, Cheng Wen, Ji-Rong Jia, Kai Computation and Language Agentic systems increasingly automate pieces of AI research. Yet turning underspecified research objectives into runnable, experimentally validated ML systems remains a central bottleneck. We study this operational setting as \emph{long-horizon ML research engineering}: converting a research specification into a runnable ML system through repeated implementation, experimentation, and refinement. The central challenge is to sustain cumulative project progress across heterogeneous stages under delayed, confounded feedback. We introduce AiScientist, a multi-agent system built around thin control over thick state: a lightweight hierarchical research team coordinates through a File-as-Bus workspace that preserves decision-relevant artifacts across roles and invocations. On PaperBench, AiScientist improves over the strongest matched baselines by 9.92 and 11.15 points with Gemini-3-Flash and GLM-5, respectively. On MLE-Bench Lite, it reaches 81.82 Any Medal\% under both backbones, improving over the strongest matched baselines by 4.55 and 16.67 points, and exceeding a Codex/GPT-5.5 xhigh frontier harness reference by 13.64 Any Medal points. Ablations and process analyses show that durable project state is central to later-round refinement: removing File-as-Bus lowers PaperBench score by 6.41 points and MLE-Bench Lite Any Medal\% by 31.82 points. These results suggest that long-horizon AI research is not only a problem of stronger local reasoning, but a systems problem of maintaining cumulative, inspectable project progress.
title	Toward Autonomous Long-Horizon Engineering for ML Research
topic	Computation and Language
url	https://arxiv.org/abs/2604.13018

Documenti analoghi