Saved in:
| Main Author: | |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2025
|
| Online Access: | https://doi.org/10.5281/zenodo.17495578 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866901602488549376 |
|---|---|
| author | Masloub, David |
| author_facet | Masloub, David |
| contents | <p>Proximal Policy Optimization (PPO) is a founda-<br>tional on-policy reinforcement learning algorithm<br>known for stable training but limited sample ef-<br>ficiency. Recent advances like PPO+ improved<br>PPO’s performance through off-policy critic train-<br>ing, bounded action outputs, and entropy regu-<br>larization, yet PPO+ still updates its actor on-<br>policy. In this paper, we present RePPO+, a<br>novel extension of PPO+ that employs a fully<br>off-policy actor without sacrificing PPO’s char-<br>acteristic stability. The key idea is to replace<br>PPO+’s on-policy clipped surrogate loss with a<br>new off-policy surrogate that combines Off-Policy<br>PPO (OP-PPO) importance weighting and Simple<br>Policy Optimization (SPO)’s smooth trust-region<br>penalty. Naively training the PPO+ actor with<br>replay buffer data leads to instability and diver-<br>gence due to ratio blowup - excessively large im-<br>portance weights that cause the loss to explode.<br>We address this by augmenting the surrogate ob-<br>jective with SPO’s quadratic penalty on policy<br>ratio deviations, which tames large updates, and<br>by incorporating OP-PPO’s adaptive clipping of<br>importance weights to further constrain off-policy<br>drift. This OP-SPO surrogate enables stable actor<br>updates from replay buffer data while preserving<br>the trust-region constraint underpinning PPO’s<br>reliability.</p> |
| format | Recurso digital |
| id | zenodo_https___doi_org_10_5281_zenodo_17495578 |
| institution | Zenodo |
| language | |
| publishDate | 2025 |
| publisher | Zenodo |
| record_format | zenodo |
| spellingShingle | RePPO+: Replay-Driven Proximal Policy Optimization with Off-Policy Correction Masloub, David <p>Proximal Policy Optimization (PPO) is a founda-<br>tional on-policy reinforcement learning algorithm<br>known for stable training but limited sample ef-<br>ficiency. Recent advances like PPO+ improved<br>PPO’s performance through off-policy critic train-<br>ing, bounded action outputs, and entropy regu-<br>larization, yet PPO+ still updates its actor on-<br>policy. In this paper, we present RePPO+, a<br>novel extension of PPO+ that employs a fully<br>off-policy actor without sacrificing PPO’s char-<br>acteristic stability. The key idea is to replace<br>PPO+’s on-policy clipped surrogate loss with a<br>new off-policy surrogate that combines Off-Policy<br>PPO (OP-PPO) importance weighting and Simple<br>Policy Optimization (SPO)’s smooth trust-region<br>penalty. Naively training the PPO+ actor with<br>replay buffer data leads to instability and diver-<br>gence due to ratio blowup - excessively large im-<br>portance weights that cause the loss to explode.<br>We address this by augmenting the surrogate ob-<br>jective with SPO’s quadratic penalty on policy<br>ratio deviations, which tames large updates, and<br>by incorporating OP-PPO’s adaptive clipping of<br>importance weights to further constrain off-policy<br>drift. This OP-SPO surrogate enables stable actor<br>updates from replay buffer data while preserving<br>the trust-region constraint underpinning PPO’s<br>reliability.</p> |
| title | RePPO+: Replay-Driven Proximal Policy Optimization with Off-Policy Correction |
| url | https://doi.org/10.5281/zenodo.17495578 |