Saved in:
Bibliographic Details
Main Author: Masloub, David
Format: Recurso digital
Language:
Published: Zenodo 2025
Online Access:https://doi.org/10.5281/zenodo.17495578
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866901602488549376
author Masloub, David
author_facet Masloub, David
contents <p>Proximal Policy Optimization (PPO) is a founda-<br>tional on-policy reinforcement learning algorithm<br>known for stable training but limited sample ef-<br>ficiency. Recent advances like PPO+ improved<br>PPO’s performance through off-policy critic train-<br>ing, bounded action outputs, and entropy regu-<br>larization, yet PPO+ still updates its actor on-<br>policy. In this paper, we present RePPO+, a<br>novel extension of PPO+ that employs a fully<br>off-policy actor without sacrificing PPO’s char-<br>acteristic stability. The key idea is to replace<br>PPO+’s on-policy clipped surrogate loss with a<br>new off-policy surrogate that combines Off-Policy<br>PPO (OP-PPO) importance weighting and Simple<br>Policy Optimization (SPO)’s smooth trust-region<br>penalty. Naively training the PPO+ actor with<br>replay buffer data leads to instability and diver-<br>gence due to ratio blowup - excessively large im-<br>portance weights that cause the loss to explode.<br>We address this by augmenting the surrogate ob-<br>jective with SPO’s quadratic penalty on policy<br>ratio deviations, which tames large updates, and<br>by incorporating OP-PPO’s adaptive clipping of<br>importance weights to further constrain off-policy<br>drift. This OP-SPO surrogate enables stable actor<br>updates from replay buffer data while preserving<br>the trust-region constraint underpinning PPO’s<br>reliability.</p>
format Recurso digital
id zenodo_https___doi_org_10_5281_zenodo_17495578
institution Zenodo
language
publishDate 2025
publisher Zenodo
record_format zenodo
spellingShingle RePPO+: Replay-Driven Proximal Policy Optimization with Off-Policy Correction
Masloub, David
<p>Proximal Policy Optimization (PPO) is a founda-<br>tional on-policy reinforcement learning algorithm<br>known for stable training but limited sample ef-<br>ficiency. Recent advances like PPO+ improved<br>PPO’s performance through off-policy critic train-<br>ing, bounded action outputs, and entropy regu-<br>larization, yet PPO+ still updates its actor on-<br>policy. In this paper, we present RePPO+, a<br>novel extension of PPO+ that employs a fully<br>off-policy actor without sacrificing PPO’s char-<br>acteristic stability. The key idea is to replace<br>PPO+’s on-policy clipped surrogate loss with a<br>new off-policy surrogate that combines Off-Policy<br>PPO (OP-PPO) importance weighting and Simple<br>Policy Optimization (SPO)’s smooth trust-region<br>penalty. Naively training the PPO+ actor with<br>replay buffer data leads to instability and diver-<br>gence due to ratio blowup - excessively large im-<br>portance weights that cause the loss to explode.<br>We address this by augmenting the surrogate ob-<br>jective with SPO’s quadratic penalty on policy<br>ratio deviations, which tames large updates, and<br>by incorporating OP-PPO’s adaptive clipping of<br>importance weights to further constrain off-policy<br>drift. This OP-SPO surrogate enables stable actor<br>updates from replay buffer data while preserving<br>the trust-region constraint underpinning PPO’s<br>reliability.</p>
title RePPO+: Replay-Driven Proximal Policy Optimization with Off-Policy Correction
url https://doi.org/10.5281/zenodo.17495578