Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lin, Zhihang, Lin, Mingbao, Xie, Yuan, Ji, Rongrong
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2503.22342
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915605210202112
author	Lin, Zhihang Lin, Mingbao Xie, Yuan Ji, Rongrong
author_facet	Lin, Zhihang Lin, Mingbao Xie, Yuan Ji, Rongrong
contents	This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to sample multiple completions for each question. Our experiment and theoretical analysis reveal that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experiments show that CPPO achieves up to $7.98\times$ speedup on GSM8K and $3.48\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at \href{https://github.com/lzhxmu/CPPO}{https://github.com/lzhxmu/CPPO}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_22342
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models Lin, Zhihang Lin, Mingbao Xie, Yuan Ji, Rongrong Artificial Intelligence This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to sample multiple completions for each question. Our experiment and theoretical analysis reveal that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experiments show that CPPO achieves up to $7.98\times$ speedup on GSM8K and $3.48\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at \href{https://github.com/lzhxmu/CPPO}{https://github.com/lzhxmu/CPPO}.
title	CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models
topic	Artificial Intelligence
url	https://arxiv.org/abs/2503.22342

Similar Items