Saved in:
Bibliographic Details
Main Authors: Bartoldson, Brian, Venkatraman, Siddarth, Diffenderfer, James, Jain, Moksh, Ben-Nun, Tal, Lee, Seanie, Kim, Minsu, Obando-Ceron, Johan, Bengio, Yoshua, Kailkhura, Bhavya
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.18929
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914178965438464
author Bartoldson, Brian
Venkatraman, Siddarth
Diffenderfer, James
Jain, Moksh
Ben-Nun, Tal
Lee, Seanie
Kim, Minsu
Obando-Ceron, Johan
Bengio, Yoshua
Kailkhura, Bhavya
author_facet Bartoldson, Brian
Venkatraman, Siddarth
Diffenderfer, James
Jain, Moksh
Ben-Nun, Tal
Lee, Seanie
Kim, Minsu
Obando-Ceron, Johan
Bengio, Yoshua
Kailkhura, Bhavya
contents Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, on-policy algorithms used for post-training are not naturally robust to a diversified content of experience replay buffers, which asynchronous off-policy actors can efficiently populate in parallel to training. We propose efficiently learning on such off-policy data via Trajectory Balance with Asynchrony (TBA), an approach to asynchronous RL for LLMs that leverages the principled off-policy TB objective. On math, preference-tuning, and automated red-teaming tasks, we post-train models ranging from Pythia 410M to Qwen 2.5 7B, finding TBA offers speed and performance boosts over strong baselines like Online DPO and Dr. GRPO. Beyond TBA's performance benefits (high accuracy even as asynchrony grows) and speedups ($4\times$ or more), we show its reward- and recency-prioritizing sampling enable further gains as data generation is scaled. Our code is available at https://github.com/bbartoldson/TBA.
format Preprint
id arxiv_https___arxiv_org_abs_2503_18929
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
Bartoldson, Brian
Venkatraman, Siddarth
Diffenderfer, James
Jain, Moksh
Ben-Nun, Tal
Lee, Seanie
Kim, Minsu
Obando-Ceron, Johan
Bengio, Yoshua
Kailkhura, Bhavya
Machine Learning
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, on-policy algorithms used for post-training are not naturally robust to a diversified content of experience replay buffers, which asynchronous off-policy actors can efficiently populate in parallel to training. We propose efficiently learning on such off-policy data via Trajectory Balance with Asynchrony (TBA), an approach to asynchronous RL for LLMs that leverages the principled off-policy TB objective. On math, preference-tuning, and automated red-teaming tasks, we post-train models ranging from Pythia 410M to Qwen 2.5 7B, finding TBA offers speed and performance boosts over strong baselines like Online DPO and Dr. GRPO. Beyond TBA's performance benefits (high accuracy even as asynchrony grows) and speedups ($4\times$ or more), we show its reward- and recency-prioritizing sampling enable further gains as data generation is scaled. Our code is available at https://github.com/bbartoldson/TBA.
title Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
topic Machine Learning
url https://arxiv.org/abs/2503.18929