Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Yuan, Yufeng, Yue, Yu, Zhu, Ruofei, Fan, Tiantian, Yan, Lin
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2503.01491
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866913716229898240
author Yuan, Yufeng
Yue, Yu
Zhu, Ruofei
Fan, Tiantian
Yan, Lin
author_facet Yuan, Yufeng
Yue, Yu
Zhu, Ruofei
Fan, Tiantian
Yan, Lin
contents Reinforcement learning (RL) is pivotal for enabling large language models (LLMs) to generate long chains of thought (CoT) for complex tasks like math and reasoning. However, Proximal Policy Optimization (PPO), effective in many RL scenarios, fails in long CoT tasks. This paper identifies that value initialization bias and reward signal decay are the root causes of PPO's failure. We propose Value-Calibrated PPO (VC-PPO) to address these issues. In VC-PPO, the value model is pretrained to tackle initialization bias, and the Generalized Advantage Estimation (GAE) computation is decoupled between the actor and critic to mitigate reward signal decay. Experiments on the American Invitational Mathematics Examination (AIME) show that VC-PPO significantly boosts PPO performance. Ablation studies show that techniques in VC-PPO are essential in enhancing PPO for long CoT tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2503_01491
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret
Yuan, Yufeng
Yue, Yu
Zhu, Ruofei
Fan, Tiantian
Yan, Lin
Machine Learning
Reinforcement learning (RL) is pivotal for enabling large language models (LLMs) to generate long chains of thought (CoT) for complex tasks like math and reasoning. However, Proximal Policy Optimization (PPO), effective in many RL scenarios, fails in long CoT tasks. This paper identifies that value initialization bias and reward signal decay are the root causes of PPO's failure. We propose Value-Calibrated PPO (VC-PPO) to address these issues. In VC-PPO, the value model is pretrained to tackle initialization bias, and the Generalized Advantage Estimation (GAE) computation is decoupled between the actor and critic to mitigate reward signal decay. Experiments on the American Invitational Mathematics Examination (AIME) show that VC-PPO significantly boosts PPO performance. Ablation studies show that techniques in VC-PPO are essential in enhancing PPO for long CoT tasks.
title What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret
topic Machine Learning
url https://arxiv.org/abs/2503.01491