Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.08747 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866918500589633536 |
|---|---|
| author | Chen, Ying Fang, Lihuang Jiang, Rui Wang, Mingxu Gu, Zhifeng Yi, Lei Chen, Jie |
| author_facet | Chen, Ying Fang, Lihuang Jiang, Rui Wang, Mingxu Gu, Zhifeng Yi, Lei Chen, Jie |
| contents | Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2605_08747 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents Chen, Ying Fang, Lihuang Jiang, Rui Wang, Mingxu Gu, Zhifeng Yi, Lei Chen, Jie Artificial Intelligence Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable. |
| title | Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents |
| topic | Artificial Intelligence |
| url | https://arxiv.org/abs/2605.08747 |