Saved in:
| Main Author: | |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.10498 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917277436215296 |
|---|---|
| author | Abrahamsen, Nilin |
| author_facet | Abrahamsen, Nilin |
| contents | This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_10498 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates Abrahamsen, Nilin Machine Learning Artificial Intelligence This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance. |
| title | PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates |
| topic | Machine Learning Artificial Intelligence |
| url | https://arxiv.org/abs/2601.10498 |