Saved in:
Bibliographic Details
Main Authors: Gupta, Abhishek, Mahajan, Aditya
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.17875
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911557431066624
author Gupta, Abhishek
Mahajan, Aditya
author_facet Gupta, Abhishek
Mahajan, Aditya
contents Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. A new existence result is established for the existence of optimal policies in general MDPs, which differs from the existence result derived previously in the literature. Using the well-established perturbation theory of linear operators, policy difference lemma is established for general MDPs and the Gauteaux derivative of the objective function as a function of the policy operator is derived. By upper bounding the policy difference via the theory of integral probability metric, a new majorization-minimization type policy gradient algorithm for general MDPs is derived. This leads to generalization of many well-known algorithms in reinforcement learning to cases with general state and action spaces. Further, by taking the integral probability metric as maximum mean discrepancy, a low-complexity policy gradient algorithm is derived for finite MDPs. The new algorithm, called MM-RKHS, appears to be superior to PPO algorithm due to low computational complexity, low sample complexity, and faster convergence.
format Preprint
id arxiv_https___arxiv_org_abs_2603_17875
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs
Gupta, Abhishek
Mahajan, Aditya
Machine Learning
Optimization and Control
90C40, 47A55, 90C26, 93E35
Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. A new existence result is established for the existence of optimal policies in general MDPs, which differs from the existence result derived previously in the literature. Using the well-established perturbation theory of linear operators, policy difference lemma is established for general MDPs and the Gauteaux derivative of the objective function as a function of the policy operator is derived. By upper bounding the policy difference via the theory of integral probability metric, a new majorization-minimization type policy gradient algorithm for general MDPs is derived. This leads to generalization of many well-known algorithms in reinforcement learning to cases with general state and action spaces. Further, by taking the integral probability metric as maximum mean discrepancy, a low-complexity policy gradient algorithm is derived for finite MDPs. The new algorithm, called MM-RKHS, appears to be superior to PPO algorithm due to low computational complexity, low sample complexity, and faster convergence.
title Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs
topic Machine Learning
Optimization and Control
90C40, 47A55, 90C26, 93E35
url https://arxiv.org/abs/2603.17875