Gespeichert in:
| Hauptverfasser: | , |
|---|---|
| Format: | Preprint |
| Veröffentlicht: |
2026
|
| Schlagworte: | |
| Online-Zugang: | https://arxiv.org/abs/2602.00474 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| _version_ | 1866914540309970944 |
|---|---|
| author | Xu, Yang Aggarwal, Vaneet |
| author_facet | Xu, Yang Aggarwal, Vaneet |
| contents | We study fixed-policy evaluation for finite Markov chains that may be reducible and periodic. Classical evaluation methods with gain and bias decomposition are not always diagnostic: the gain records only invariant Cesàro averages, while persistent phase-dependent behavior is absorbed into the bias together with genuinely transient effects. We identify the real peripheral invariant subspace $\mathcal{K}(P)$ of the transition matrix $P$ as the source of this ambiguity. Quotienting by $\mathcal{K}(P)$ is the minimal exact quotient that removes all non-decaying modes and makes the remaining dynamics strictly stable. After choosing a gauge projection $Π$ with kernel $\mathcal{K}(P)$, the reward admits a unique decomposition $r = g_Π^\star + (I-P)v_Π^\star$, where $g_Π^\star$ is a persistent regime profile and $v_Π^\star$ is a gauge-fixed transient component. An exact comparison with classical normalized gain and bias shows that the new pair reallocates the same information so that all persistent modes are represented in $g_Π^\star$ and $v_Π^\star$ is transient. This decomposition reconstructs finite-horizon returns, recovers statewise average reward, admits a transient-cost interpretation, and yields a stable estimator under a generative model. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_00474 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients Xu, Yang Aggarwal, Vaneet Machine Learning Numerical Analysis We study fixed-policy evaluation for finite Markov chains that may be reducible and periodic. Classical evaluation methods with gain and bias decomposition are not always diagnostic: the gain records only invariant Cesàro averages, while persistent phase-dependent behavior is absorbed into the bias together with genuinely transient effects. We identify the real peripheral invariant subspace $\mathcal{K}(P)$ of the transition matrix $P$ as the source of this ambiguity. Quotienting by $\mathcal{K}(P)$ is the minimal exact quotient that removes all non-decaying modes and makes the remaining dynamics strictly stable. After choosing a gauge projection $Π$ with kernel $\mathcal{K}(P)$, the reward admits a unique decomposition $r = g_Π^\star + (I-P)v_Π^\star$, where $g_Π^\star$ is a persistent regime profile and $v_Π^\star$ is a gauge-fixed transient component. An exact comparison with classical normalized gain and bias shows that the new pair reallocates the same information so that all persistent modes are represented in $g_Π^\star$ and $v_Π^\star$ is transient. This decomposition reconstructs finite-horizon returns, recovers statewise average reward, admits a transient-cost interpretation, and yields a stable estimator under a generative model. |
| title | Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients |
| topic | Machine Learning Numerical Analysis |
| url | https://arxiv.org/abs/2602.00474 |