Saved in:
Bibliographic Details
Main Authors: Amico, Jeffrey, Andrade, Gabriel Passamani, Donaghy, John, Fielding, Ben, Forbus, Tristin, Grieve, Harry, Kara, Semih, Kolehmainen, Jari, Lou, Yihua, Nies, Christopher, Nuño, Edward Phillip Flores, Ortega, Diogo, Rastogi, Shikhar, Virts, Austin, Wright, Matthew J.
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.08721
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908530610536448
author Amico, Jeffrey
Andrade, Gabriel Passamani
Donaghy, John
Fielding, Ben
Forbus, Tristin
Grieve, Harry
Kara, Semih
Kolehmainen, Jari
Lou, Yihua
Nies, Christopher
Nuño, Edward Phillip Flores
Ortega, Diogo
Rastogi, Shikhar
Virts, Austin
Wright, Matthew J.
author_facet Amico, Jeffrey
Andrade, Gabriel Passamani
Donaghy, John
Fielding, Ben
Forbus, Tristin
Grieve, Harry
Kara, Semih
Kolehmainen, Jari
Lou, Yihua
Nies, Christopher
Nuño, Edward Phillip Flores
Ortega, Diogo
Rastogi, Shikhar
Virts, Austin
Wright, Matthew J.
contents Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.
format Preprint
id arxiv_https___arxiv_org_abs_2509_08721
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Amico, Jeffrey
Andrade, Gabriel Passamani
Donaghy, John
Fielding, Ben
Forbus, Tristin
Grieve, Harry
Kara, Semih
Kolehmainen, Jari
Lou, Yihua
Nies, Christopher
Nuño, Edward Phillip Flores
Ortega, Diogo
Rastogi, Shikhar
Virts, Austin
Wright, Matthew J.
Machine Learning
Multiagent Systems
Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.
title Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
topic Machine Learning
Multiagent Systems
url https://arxiv.org/abs/2509.08721