Saved in:
Bibliographic Details
Main Authors: Chen, Ying, Fang, Lihuang, Jiang, Rui, Wang, Mingxu, Gu, Zhifeng, Yi, Lei, Chen, Jie
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.08747
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918500589633536
author Chen, Ying
Fang, Lihuang
Jiang, Rui
Wang, Mingxu
Gu, Zhifeng
Yi, Lei
Chen, Jie
author_facet Chen, Ying
Fang, Lihuang
Jiang, Rui
Wang, Mingxu
Gu, Zhifeng
Yi, Lei
Chen, Jie
contents Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.
format Preprint
id arxiv_https___arxiv_org_abs_2605_08747
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
Chen, Ying
Fang, Lihuang
Jiang, Rui
Wang, Mingxu
Gu, Zhifeng
Yi, Lei
Chen, Jie
Artificial Intelligence
Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.
title Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
topic Artificial Intelligence
url https://arxiv.org/abs/2605.08747