Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Abrahamsen, Nilin
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.10498
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917277436215296
author	Abrahamsen, Nilin
author_facet	Abrahamsen, Nilin
contents	This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_10498
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates Abrahamsen, Nilin Machine Learning Artificial Intelligence This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance.
title	PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2601.10498

Similar Items