Saved in:
Bibliographic Details
Main Authors: Li, Guangrui, Xie, Yaochen, Liu, Yi, Dong, Ziwei, Pan, Xingyuan, Zheng, Tianqi, Choi, Jason, Morais, Michael J., Jha, Binit, Mishra, Shaunak, Zhou, Bingrou, Luo, Chen, Cheng, Monica Xiao, Song, Dawn
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.05910
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913145912557568
author Li, Guangrui
Xie, Yaochen
Liu, Yi
Dong, Ziwei
Pan, Xingyuan
Zheng, Tianqi
Choi, Jason
Morais, Michael J.
Jha, Binit
Mishra, Shaunak
Zhou, Bingrou
Luo, Chen
Cheng, Monica Xiao
Song, Dawn
author_facet Li, Guangrui
Xie, Yaochen
Liu, Yi
Dong, Ziwei
Pan, Xingyuan
Zheng, Tianqi
Choi, Jason
Morais, Michael J.
Jha, Binit
Mishra, Shaunak
Zhou, Bingrou
Luo, Chen
Cheng, Monica Xiao
Song, Dawn
contents LLM-powered tool-calling agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks evaluate these systems under static environment interfaces, with fixed schemas and toolsets, making it difficult to assess how agents behave as environments evolves -- when capabilities are added, reorganized, or deprecated across successive environment versions. In this paper, we study structured environment evolution as a benchmark-construction problem for tool-calling agents. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment - data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve supports (1) automatic generation of evolved executable environments through explicit graph transformations, and (2) graph-grounded construction of task sandboxes via subgraph sampling and instantiation. We validate ProEvolve in two tool-calling domains, e-commerce and airline booking, in terms of quality, implementation validity, and failure modes. Finally, we use the generated benchmark as a downstream diagnostic to study how representative agents behave under structured environment evolution.
format Preprint
id arxiv_https___arxiv_org_abs_2603_05910
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle The World Won't Stay Still: Programmable Evolution for Agent Benchmarks
Li, Guangrui
Xie, Yaochen
Liu, Yi
Dong, Ziwei
Pan, Xingyuan
Zheng, Tianqi
Choi, Jason
Morais, Michael J.
Jha, Binit
Mishra, Shaunak
Zhou, Bingrou
Luo, Chen
Cheng, Monica Xiao
Song, Dawn
Artificial Intelligence
LLM-powered tool-calling agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks evaluate these systems under static environment interfaces, with fixed schemas and toolsets, making it difficult to assess how agents behave as environments evolves -- when capabilities are added, reorganized, or deprecated across successive environment versions. In this paper, we study structured environment evolution as a benchmark-construction problem for tool-calling agents. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment - data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve supports (1) automatic generation of evolved executable environments through explicit graph transformations, and (2) graph-grounded construction of task sandboxes via subgraph sampling and instantiation. We validate ProEvolve in two tool-calling domains, e-commerce and airline booking, in terms of quality, implementation validity, and failure modes. Finally, we use the generated benchmark as a downstream diagnostic to study how representative agents behave under structured environment evolution.
title The World Won't Stay Still: Programmable Evolution for Agent Benchmarks
topic Artificial Intelligence
url https://arxiv.org/abs/2603.05910