Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Masloub, David
Format:	Recurso digital
Language:
Published:	Zenodo 2025
Online Access:	https://doi.org/10.5281/zenodo.17495578
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866901602488549376
author	Masloub, David
author_facet	Masloub, David
contents	<p>Proximal Policy Optimization (PPO) is a founda-<br>tional on-policy reinforcement learning algorithm<br>known for stable training but limited sample ef-<br>ficiency. Recent advances like PPO+ improved<br>PPO’s performance through off-policy critic train-<br>ing, bounded action outputs, and entropy regu-<br>larization, yet PPO+ still updates its actor on-<br>policy. In this paper, we present RePPO+, a<br>novel extension of PPO+ that employs a fully<br>off-policy actor without sacrificing PPO’s char-<br>acteristic stability. The key idea is to replace<br>PPO+’s on-policy clipped surrogate loss with a<br>new off-policy surrogate that combines Off-Policy<br>PPO (OP-PPO) importance weighting and Simple<br>Policy Optimization (SPO)’s smooth trust-region<br>penalty. Naively training the PPO+ actor with<br>replay buffer data leads to instability and diver-<br>gence due to ratio blowup - excessively large im-<br>portance weights that cause the loss to explode.<br>We address this by augmenting the surrogate ob-<br>jective with SPO’s quadratic penalty on policy<br>ratio deviations, which tames large updates, and<br>by incorporating OP-PPO’s adaptive clipping of<br>importance weights to further constrain off-policy<br>drift. This OP-SPO surrogate enables stable actor<br>updates from replay buffer data while preserving<br>the trust-region constraint underpinning PPO’s<br>reliability.</p>
format	Recurso digital
id	zenodo_https___doi_org_10_5281_zenodo_17495578
institution	Zenodo
language
publishDate	2025
publisher	Zenodo
record_format	zenodo
spellingShingle	RePPO+: Replay-Driven Proximal Policy Optimization with Off-Policy Correction Masloub, David <p>Proximal Policy Optimization (PPO) is a founda-<br>tional on-policy reinforcement learning algorithm<br>known for stable training but limited sample ef-<br>ficiency. Recent advances like PPO+ improved<br>PPO’s performance through off-policy critic train-<br>ing, bounded action outputs, and entropy regu-<br>larization, yet PPO+ still updates its actor on-<br>policy. In this paper, we present RePPO+, a<br>novel extension of PPO+ that employs a fully<br>off-policy actor without sacrificing PPO’s char-<br>acteristic stability. The key idea is to replace<br>PPO+’s on-policy clipped surrogate loss with a<br>new off-policy surrogate that combines Off-Policy<br>PPO (OP-PPO) importance weighting and Simple<br>Policy Optimization (SPO)’s smooth trust-region<br>penalty. Naively training the PPO+ actor with<br>replay buffer data leads to instability and diver-<br>gence due to ratio blowup - excessively large im-<br>portance weights that cause the loss to explode.<br>We address this by augmenting the surrogate ob-<br>jective with SPO’s quadratic penalty on policy<br>ratio deviations, which tames large updates, and<br>by incorporating OP-PPO’s adaptive clipping of<br>importance weights to further constrain off-policy<br>drift. This OP-SPO surrogate enables stable actor<br>updates from replay buffer data while preserving<br>the trust-region constraint underpinning PPO’s<br>reliability.</p>
title	RePPO+: Replay-Driven Proximal Policy Optimization with Off-Policy Correction
url	https://doi.org/10.5281/zenodo.17495578

Similar Items