Saved in:
Bibliographic Details
Main Authors: Zhang, Liujie, Ning, Benzhe, Yang, Rui, Yu, Xiaoyan, Li, Jiaxing, Wu, Lumeng, Liu, Jia, Li, Minghao, Chen, Weihang, Hu, Weiqi, Zhang, Lei
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.11554
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915935918489600
author Zhang, Liujie
Ning, Benzhe
Yang, Rui
Yu, Xiaoyan
Li, Jiaxing
Wu, Lumeng
Liu, Jia
Li, Minghao
Chen, Weihang
Hu, Weiqi
Zhang, Lei
author_facet Zhang, Liujie
Ning, Benzhe
Yang, Rui
Yu, Xiaoyan
Li, Jiaxing
Wu, Lumeng
Liu, Jia
Li, Minghao
Chen, Weihang
Hu, Weiqi
Zhang, Lei
contents Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness -- throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack -- from data preprocessing and modality-aware parallelism to inference generation -- rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9\% overhead, compared to 32\% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.
format Preprint
id arxiv_https___arxiv_org_abs_2604_11554
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
Zhang, Liujie
Ning, Benzhe
Yang, Rui
Yu, Xiaoyan
Li, Jiaxing
Wu, Lumeng
Liu, Jia
Li, Minghao
Chen, Weihang
Hu, Weiqi
Zhang, Lei
Computation and Language
Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness -- throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack -- from data preprocessing and modality-aware parallelism to inference generation -- rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9\% overhead, compared to 32\% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.
title Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
topic Computation and Language
url https://arxiv.org/abs/2604.11554