Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Xu, Yang, Aggarwal, Vaneet
Format: Preprint
Veröffentlicht: 2026
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2602.00474
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866914540309970944
author Xu, Yang
Aggarwal, Vaneet
author_facet Xu, Yang
Aggarwal, Vaneet
contents We study fixed-policy evaluation for finite Markov chains that may be reducible and periodic. Classical evaluation methods with gain and bias decomposition are not always diagnostic: the gain records only invariant Cesàro averages, while persistent phase-dependent behavior is absorbed into the bias together with genuinely transient effects. We identify the real peripheral invariant subspace $\mathcal{K}(P)$ of the transition matrix $P$ as the source of this ambiguity. Quotienting by $\mathcal{K}(P)$ is the minimal exact quotient that removes all non-decaying modes and makes the remaining dynamics strictly stable. After choosing a gauge projection $Π$ with kernel $\mathcal{K}(P)$, the reward admits a unique decomposition $r = g_Π^\star + (I-P)v_Π^\star$, where $g_Π^\star$ is a persistent regime profile and $v_Π^\star$ is a gauge-fixed transient component. An exact comparison with classical normalized gain and bias shows that the new pair reallocates the same information so that all persistent modes are represented in $g_Π^\star$ and $v_Π^\star$ is transient. This decomposition reconstructs finite-horizon returns, recovers statewise average reward, admits a transient-cost interpretation, and yields a stable estimator under a generative model.
format Preprint
id arxiv_https___arxiv_org_abs_2602_00474
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients
Xu, Yang
Aggarwal, Vaneet
Machine Learning
Numerical Analysis
We study fixed-policy evaluation for finite Markov chains that may be reducible and periodic. Classical evaluation methods with gain and bias decomposition are not always diagnostic: the gain records only invariant Cesàro averages, while persistent phase-dependent behavior is absorbed into the bias together with genuinely transient effects. We identify the real peripheral invariant subspace $\mathcal{K}(P)$ of the transition matrix $P$ as the source of this ambiguity. Quotienting by $\mathcal{K}(P)$ is the minimal exact quotient that removes all non-decaying modes and makes the remaining dynamics strictly stable. After choosing a gauge projection $Π$ with kernel $\mathcal{K}(P)$, the reward admits a unique decomposition $r = g_Π^\star + (I-P)v_Π^\star$, where $g_Π^\star$ is a persistent regime profile and $v_Π^\star$ is a gauge-fixed transient component. An exact comparison with classical normalized gain and bias shows that the new pair reallocates the same information so that all persistent modes are represented in $g_Π^\star$ and $v_Π^\star$ is transient. This decomposition reconstructs finite-horizon returns, recovers statewise average reward, admits a transient-cost interpretation, and yields a stable estimator under a generative model.
title Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients
topic Machine Learning
Numerical Analysis
url https://arxiv.org/abs/2602.00474