Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Maass, Wolfgang, Janzen, Sabine
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.26657
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910258412126208
author	Maass, Wolfgang Janzen, Sabine
author_facet	Maass, Wolfgang Janzen, Sabine
contents	Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty's equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap ($ΔM_{\text{final}} = 0.271$) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at $H = 15$ consistent with the $H^$ boundary ($H^ \in [6, 14]$ under the NBA parameters).
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_26657
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems Maass, Wolfgang Janzen, Sabine Artificial Intelligence Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty's equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap ($ΔM_{\text{final}} = 0.271$) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at $H = 15$ consistent with the $H^$ boundary ($H^ \in [6, 14]$ under the NBA parameters).
title	Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.26657

Similar Items