Saved in:
Bibliographic Details
Main Authors: Gan, Yaozhong, Yan, Renye, Tan, Xiaoyang, Wu, Zhe, Xing, Junliang
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.03894
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Proximal Policy Optimization (PPO) is a popular model-free reinforcement learning algorithm, esteemed for its simplicity and efficacy. However, due to its inherent on-policy nature, its proficiency in harnessing data from disparate policies is constrained. This paper introduces a novel off-policy extension to the original PPO method, christened Transductive Off-policy PPO (ToPPO). Herein, we provide theoretical justification for incorporating off-policy data in PPO training and prudent guidelines for its safe application. Our contribution includes a novel formulation of the policy improvement lower bound for prospective policies derived from off-policy data, accompanied by a computationally efficient mechanism to optimize this bound, underpinned by assurances of monotonic improvement. Comprehensive experimental results across six representative tasks underscore ToPPO's promising performance.