Saved in:
Bibliographic Details
Main Author: Abrahamsen, Nilin
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.10498
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917277436215296
author Abrahamsen, Nilin
author_facet Abrahamsen, Nilin
contents This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance.
format Preprint
id arxiv_https___arxiv_org_abs_2601_10498
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates
Abrahamsen, Nilin
Machine Learning
Artificial Intelligence
This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance.
title PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2601.10498