Saved in:
Bibliographic Details
Main Authors: Zhang, Zheyu, Yang, Shuo, Kasneci, Gjergji
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.31494
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914618235944960
author Zhang, Zheyu
Yang, Shuo
Kasneci, Gjergji
author_facet Zhang, Zheyu
Yang, Shuo
Kasneci, Gjergji
contents Post-training of language models is commonly framed as a sample-score-update loop implemented by gradient descent. A recent line of work, exemplified by RandOpt, relocates this loop to weight space, sampling Gaussian perturbations around a pretrained model and ensembling the top-K rewarded specialists at inference. While competitive with PPO and GRPO under matched training compute, this prediction-level ensemble incurs K forward passes per test example and does not extend cleanly to free-form generation. We ask whether the rewarded population can instead be folded into a single deployable model, replacing the inference-time ensemble with one consolidated update. A split-half analysis over 25 model-task pairs reveals reproducible low-rank structure in every case. We turn this geometry into CoRP (Consolidating Rewarded Perturbations), a gradient-free operator that combines reward-weighted aggregation, compatibility-aware reweighting, and a held-out validation gate, with no gradient flowing through the language model. Across five language models from 0.5B to 8B and five tasks covering math, code, and creative writing, CoRP improves the base model by 8.1 points on average. Using one tenth of RandOpt's perturbation budget, CoRP exceeds single-inference RandOpt by 6.5 points and recovers more than half of the gain of the 50-pass majority-vote ensemble, at one forward pass per test example.
format Preprint
id arxiv_https___arxiv_org_abs_2605_31494
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Consolidating Rewarded Perturbations for LLM Post-Training
Zhang, Zheyu
Yang, Shuo
Kasneci, Gjergji
Computation and Language
Machine Learning
Post-training of language models is commonly framed as a sample-score-update loop implemented by gradient descent. A recent line of work, exemplified by RandOpt, relocates this loop to weight space, sampling Gaussian perturbations around a pretrained model and ensembling the top-K rewarded specialists at inference. While competitive with PPO and GRPO under matched training compute, this prediction-level ensemble incurs K forward passes per test example and does not extend cleanly to free-form generation. We ask whether the rewarded population can instead be folded into a single deployable model, replacing the inference-time ensemble with one consolidated update. A split-half analysis over 25 model-task pairs reveals reproducible low-rank structure in every case. We turn this geometry into CoRP (Consolidating Rewarded Perturbations), a gradient-free operator that combines reward-weighted aggregation, compatibility-aware reweighting, and a held-out validation gate, with no gradient flowing through the language model. Across five language models from 0.5B to 8B and five tasks covering math, code, and creative writing, CoRP improves the base model by 8.1 points on average. Using one tenth of RandOpt's perturbation budget, CoRP exceeds single-inference RandOpt by 6.5 points and recovers more than half of the gain of the 50-pass majority-vote ensemble, at one forward pass per test example.
title Consolidating Rewarded Perturbations for LLM Post-Training
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2605.31494