Saved in:
Bibliographic Details
Main Authors: Maass, Wolfgang, Janzen, Sabine
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.26657
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910258412126208
author Maass, Wolfgang
Janzen, Sabine
author_facet Maass, Wolfgang
Janzen, Sabine
contents Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty's equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap ($ΔM_{\text{final}} = 0.271$) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at $H = 15$ consistent with the $H^*$ boundary ($H^* \in [6, 14]$ under the NBA parameters).
format Preprint
id arxiv_https___arxiv_org_abs_2605_26657
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
Maass, Wolfgang
Janzen, Sabine
Artificial Intelligence
Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty's equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap ($ΔM_{\text{final}} = 0.271$) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at $H = 15$ consistent with the $H^*$ boundary ($H^* \in [6, 14]$ under the NBA parameters).
title Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
topic Artificial Intelligence
url https://arxiv.org/abs/2605.26657