Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gupta, Abhishek, Mahajan, Aditya
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Optimization and Control 90C40, 47A55, 90C26, 93E35
Online Access:	https://arxiv.org/abs/2603.17875
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911557431066624
author	Gupta, Abhishek Mahajan, Aditya
author_facet	Gupta, Abhishek Mahajan, Aditya
contents	Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. A new existence result is established for the existence of optimal policies in general MDPs, which differs from the existence result derived previously in the literature. Using the well-established perturbation theory of linear operators, policy difference lemma is established for general MDPs and the Gauteaux derivative of the objective function as a function of the policy operator is derived. By upper bounding the policy difference via the theory of integral probability metric, a new majorization-minimization type policy gradient algorithm for general MDPs is derived. This leads to generalization of many well-known algorithms in reinforcement learning to cases with general state and action spaces. Further, by taking the integral probability metric as maximum mean discrepancy, a low-complexity policy gradient algorithm is derived for finite MDPs. The new algorithm, called MM-RKHS, appears to be superior to PPO algorithm due to low computational complexity, low sample complexity, and faster convergence.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_17875
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs Gupta, Abhishek Mahajan, Aditya Machine Learning Optimization and Control 90C40, 47A55, 90C26, 93E35 Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. A new existence result is established for the existence of optimal policies in general MDPs, which differs from the existence result derived previously in the literature. Using the well-established perturbation theory of linear operators, policy difference lemma is established for general MDPs and the Gauteaux derivative of the objective function as a function of the policy operator is derived. By upper bounding the policy difference via the theory of integral probability metric, a new majorization-minimization type policy gradient algorithm for general MDPs is derived. This leads to generalization of many well-known algorithms in reinforcement learning to cases with general state and action spaces. Further, by taking the integral probability metric as maximum mean discrepancy, a low-complexity policy gradient algorithm is derived for finite MDPs. The new algorithm, called MM-RKHS, appears to be superior to PPO algorithm due to low computational complexity, low sample complexity, and faster convergence.
title	Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs
topic	Machine Learning Optimization and Control 90C40, 47A55, 90C26, 93E35
url	https://arxiv.org/abs/2603.17875

Similar Items