Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yuan, Yufeng, Yue, Yu, Zhu, Ruofei, Fan, Tiantian, Yan, Lin
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Machine Learning
Online-Zugang:	https://arxiv.org/abs/2503.01491
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866913716229898240
author	Yuan, Yufeng Yue, Yu Zhu, Ruofei Fan, Tiantian Yan, Lin
author_facet	Yuan, Yufeng Yue, Yu Zhu, Ruofei Fan, Tiantian Yan, Lin
contents	Reinforcement learning (RL) is pivotal for enabling large language models (LLMs) to generate long chains of thought (CoT) for complex tasks like math and reasoning. However, Proximal Policy Optimization (PPO), effective in many RL scenarios, fails in long CoT tasks. This paper identifies that value initialization bias and reward signal decay are the root causes of PPO's failure. We propose Value-Calibrated PPO (VC-PPO) to address these issues. In VC-PPO, the value model is pretrained to tackle initialization bias, and the Generalized Advantage Estimation (GAE) computation is decoupled between the actor and critic to mitigate reward signal decay. Experiments on the American Invitational Mathematics Examination (AIME) show that VC-PPO significantly boosts PPO performance. Ablation studies show that techniques in VC-PPO are essential in enhancing PPO for long CoT tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_01491
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret Yuan, Yufeng Yue, Yu Zhu, Ruofei Fan, Tiantian Yan, Lin Machine Learning Reinforcement learning (RL) is pivotal for enabling large language models (LLMs) to generate long chains of thought (CoT) for complex tasks like math and reasoning. However, Proximal Policy Optimization (PPO), effective in many RL scenarios, fails in long CoT tasks. This paper identifies that value initialization bias and reward signal decay are the root causes of PPO's failure. We propose Value-Calibrated PPO (VC-PPO) to address these issues. In VC-PPO, the value model is pretrained to tackle initialization bias, and the Generalized Advantage Estimation (GAE) computation is decoupled between the actor and critic to mitigate reward signal decay. Experiments on the American Invitational Mathematics Examination (AIME) show that VC-PPO significantly boosts PPO performance. Ablation studies show that techniques in VC-PPO are essential in enhancing PPO for long CoT tasks.
title	What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret
topic	Machine Learning
url	https://arxiv.org/abs/2503.01491

Ähnliche Einträge