Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Ying, Fang, Lihuang, Jiang, Rui, Wang, Mingxu, Gu, Zhifeng, Yi, Lei, Chen, Jie
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.08747
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918500589633536
author	Chen, Ying Fang, Lihuang Jiang, Rui Wang, Mingxu Gu, Zhifeng Yi, Lei Chen, Jie
author_facet	Chen, Ying Fang, Lihuang Jiang, Rui Wang, Mingxu Gu, Zhifeng Yi, Lei Chen, Jie
contents	Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_08747
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents Chen, Ying Fang, Lihuang Jiang, Rui Wang, Mingxu Gu, Zhifeng Yi, Lei Chen, Jie Artificial Intelligence Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.
title	Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.08747

Similar Items