Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xia, Linxuan, Yang, Xiaolong, Chen, Yongyuan, Zhao, Enyue, Cai, Deng, Wang, Yasheng, Wu, Boxi
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.10819
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915791583051776
author	Xia, Linxuan Yang, Xiaolong Chen, Yongyuan Zhao, Enyue Cai, Deng Wang, Yasheng Wu, Boxi
author_facet	Xia, Linxuan Yang, Xiaolong Chen, Yongyuan Zhao, Enyue Cai, Deng Wang, Yasheng Wu, Boxi
contents	Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_10819
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization Xia, Linxuan Yang, Xiaolong Chen, Yongyuan Zhao, Enyue Cai, Deng Wang, Yasheng Wu, Boxi Machine Learning Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.
title	RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization
topic	Machine Learning
url	https://arxiv.org/abs/2602.10819

Similar Items