Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Guangrui, Xie, Yaochen, Liu, Yi, Dong, Ziwei, Pan, Xingyuan, Zheng, Tianqi, Choi, Jason, Morais, Michael J., Jha, Binit, Mishra, Shaunak, Zhou, Bingrou, Luo, Chen, Cheng, Monica Xiao, Song, Dawn
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.05910
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913145912557568
author	Li, Guangrui Xie, Yaochen Liu, Yi Dong, Ziwei Pan, Xingyuan Zheng, Tianqi Choi, Jason Morais, Michael J. Jha, Binit Mishra, Shaunak Zhou, Bingrou Luo, Chen Cheng, Monica Xiao Song, Dawn
author_facet	Li, Guangrui Xie, Yaochen Liu, Yi Dong, Ziwei Pan, Xingyuan Zheng, Tianqi Choi, Jason Morais, Michael J. Jha, Binit Mishra, Shaunak Zhou, Bingrou Luo, Chen Cheng, Monica Xiao Song, Dawn
contents	LLM-powered tool-calling agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks evaluate these systems under static environment interfaces, with fixed schemas and toolsets, making it difficult to assess how agents behave as environments evolves -- when capabilities are added, reorganized, or deprecated across successive environment versions. In this paper, we study structured environment evolution as a benchmark-construction problem for tool-calling agents. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment - data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve supports (1) automatic generation of evolved executable environments through explicit graph transformations, and (2) graph-grounded construction of task sandboxes via subgraph sampling and instantiation. We validate ProEvolve in two tool-calling domains, e-commerce and airline booking, in terms of quality, implementation validity, and failure modes. Finally, we use the generated benchmark as a downstream diagnostic to study how representative agents behave under structured environment evolution.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_05910
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	The World Won't Stay Still: Programmable Evolution for Agent Benchmarks Li, Guangrui Xie, Yaochen Liu, Yi Dong, Ziwei Pan, Xingyuan Zheng, Tianqi Choi, Jason Morais, Michael J. Jha, Binit Mishra, Shaunak Zhou, Bingrou Luo, Chen Cheng, Monica Xiao Song, Dawn Artificial Intelligence LLM-powered tool-calling agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks evaluate these systems under static environment interfaces, with fixed schemas and toolsets, making it difficult to assess how agents behave as environments evolves -- when capabilities are added, reorganized, or deprecated across successive environment versions. In this paper, we study structured environment evolution as a benchmark-construction problem for tool-calling agents. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment - data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve supports (1) automatic generation of evolved executable environments through explicit graph transformations, and (2) graph-grounded construction of task sandboxes via subgraph sampling and instantiation. We validate ProEvolve in two tool-calling domains, e-commerce and airline booking, in terms of quality, implementation validity, and failure modes. Finally, we use the generated benchmark as a downstream diagnostic to study how representative agents behave under structured environment evolution.
title	The World Won't Stay Still: Programmable Evolution for Agent Benchmarks
topic	Artificial Intelligence
url	https://arxiv.org/abs/2603.05910

Similar Items