Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Liujie, Ning, Benzhe, Yang, Rui, Yu, Xiaoyan, Li, Jiaxing, Wu, Lumeng, Liu, Jia, Li, Minghao, Chen, Weihang, Hu, Weiqi, Zhang, Lei
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2604.11554
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915935918489600
author	Zhang, Liujie Ning, Benzhe Yang, Rui Yu, Xiaoyan Li, Jiaxing Wu, Lumeng Liu, Jia Li, Minghao Chen, Weihang Hu, Weiqi Zhang, Lei
author_facet	Zhang, Liujie Ning, Benzhe Yang, Rui Yu, Xiaoyan Li, Jiaxing Wu, Lumeng Liu, Jia Li, Minghao Chen, Weihang Hu, Weiqi Zhang, Lei
contents	Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness -- throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack -- from data preprocessing and modality-aware parallelism to inference generation -- rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9\% overhead, compared to 32\% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_11554
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale Zhang, Liujie Ning, Benzhe Yang, Rui Yu, Xiaoyan Li, Jiaxing Wu, Lumeng Liu, Jia Li, Minghao Chen, Weihang Hu, Weiqi Zhang, Lei Computation and Language Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness -- throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack -- from data preprocessing and modality-aware parallelism to inference generation -- rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9\% overhead, compared to 32\% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.
title	Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
topic	Computation and Language
url	https://arxiv.org/abs/2604.11554

Similar Items