Enregistré dans:
Détails bibliographiques
Auteurs principaux: Liu, Yi, Hui, TingFeng, Zhang, Wei, Sun, Li, Su, Ningxin, Wang, Jian, Su, Sen
Format: Preprint
Publié: 2026
Sujets:
Accès en ligne:https://arxiv.org/abs/2605.07247
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866910200508710912
author Liu, Yi
Hui, TingFeng
Zhang, Wei
Sun, Li
Su, Ningxin
Wang, Jian
Su, Sen
author_facet Liu, Yi
Hui, TingFeng
Zhang, Wei
Sun, Li
Su, Ningxin
Wang, Jian
Su, Sen
contents Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM-simulated counterparts. However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback. In practice, LLM-simulated environments suffer from hallucinations, logical inconsistencies, and silent state drift failures that corrupt agent reward signals and compound the construction costs that the paradigm was designed to eliminate. To address this gap, we propose EnvSimBench with four contributions: 1) We provide the first formal definition and operationalization of Environment Simulation Ability (EnvSim Ability) as a quantifiable research objective. 2) We construct EnvSimBench, a rigorous benchmark covering 400 samples across 167 diverse environments, equipped with verifiable labels and fine-grained difficulty stratification along three axes. 3) Systematic evaluations reveal that all state-of-the-art language models suffer from a universal state change cliff: they achieve near-perfect accuracy on tasks when the environment state remains invariant, yet fail catastrophically when multiple states need simultaneous updates. This finding exposes EnvSim Ability as a critical yet largely unaddressed capability gap. 4) We design a constraint-driven simulation pipeline that substantially reduces hallucination, boosts environment synthesis yield by 6.8%, and cuts costs by over 90%. Overall, EnvSimBench serves as both a diagnostic framework and a practical optimization path for reliable LLM-based environment simulation, establishing a foundation for scalable agent training. Code and data are available at https://github.com/cookieApril/EnvSimBench
format Preprint
id arxiv_https___arxiv_org_abs_2605_07247
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
Liu, Yi
Hui, TingFeng
Zhang, Wei
Sun, Li
Su, Ningxin
Wang, Jian
Su, Sen
Artificial Intelligence
Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM-simulated counterparts. However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback. In practice, LLM-simulated environments suffer from hallucinations, logical inconsistencies, and silent state drift failures that corrupt agent reward signals and compound the construction costs that the paradigm was designed to eliminate. To address this gap, we propose EnvSimBench with four contributions: 1) We provide the first formal definition and operationalization of Environment Simulation Ability (EnvSim Ability) as a quantifiable research objective. 2) We construct EnvSimBench, a rigorous benchmark covering 400 samples across 167 diverse environments, equipped with verifiable labels and fine-grained difficulty stratification along three axes. 3) Systematic evaluations reveal that all state-of-the-art language models suffer from a universal state change cliff: they achieve near-perfect accuracy on tasks when the environment state remains invariant, yet fail catastrophically when multiple states need simultaneous updates. This finding exposes EnvSim Ability as a critical yet largely unaddressed capability gap. 4) We design a constraint-driven simulation pipeline that substantially reduces hallucination, boosts environment synthesis yield by 6.8%, and cuts costs by over 90%. Overall, EnvSimBench serves as both a diagnostic framework and a practical optimization path for reliable LLM-based environment simulation, establishing a foundation for scalable agent training. Code and data are available at https://github.com/cookieApril/EnvSimBench
title EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
topic Artificial Intelligence
url https://arxiv.org/abs/2605.07247