Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.04439 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908866403368960 |
|---|---|
| author | Garg, Anisha Zhang, Claire Neema, Nishit Bick, David Venkatesh, Ganesh Hestness, Joel |
| author_facet | Garg, Anisha Zhang, Claire Neema, Nishit Bick, David Venkatesh, Ganesh Hestness, Joel |
| contents | Group-Relative Policy Optimization (GRPO) has emerged as the standard for training reasoning capabilities in large language models through reinforcement learning. By estimating advantages using group-mean rewards rather than a learned critic, GRPO has enabled efficient scaling of reinforcement learning from verifiable rewards (RLVR). However, we identify a fundamental limitation: GRPO's mean baseline can assign positive advantages to incorrect solutions simply because they outperform a poorly-performing group average. It leads to overestimation of advantages and reinforcement of incorrect behaviours. To address this, we propose Correctness-Relative Policy Optimization (CoRPO), a simple modification to the GRPO objective that clips the minimum baseline to a fixed correctness threshold. We show that baseline clipping introduces a protective bias to advantage estimation that mitigates overfitting while preserving effective exploration. Empirically, CoRPO-trained models improve cross-domain reasoning, generalizing more consistently to out-of-domain (OOD) tasks. When trained on coding tasks, CoRPO outperforms GRPO on math, and vice-versa, indicating that CoRPO learns robust, transferable reasoning patterns rather than task-specific solutions. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2511_04439 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | CoRPO: Adding a Correctness Bias to GRPO Improves Generalization Garg, Anisha Zhang, Claire Neema, Nishit Bick, David Venkatesh, Ganesh Hestness, Joel Artificial Intelligence Machine Learning Group-Relative Policy Optimization (GRPO) has emerged as the standard for training reasoning capabilities in large language models through reinforcement learning. By estimating advantages using group-mean rewards rather than a learned critic, GRPO has enabled efficient scaling of reinforcement learning from verifiable rewards (RLVR). However, we identify a fundamental limitation: GRPO's mean baseline can assign positive advantages to incorrect solutions simply because they outperform a poorly-performing group average. It leads to overestimation of advantages and reinforcement of incorrect behaviours. To address this, we propose Correctness-Relative Policy Optimization (CoRPO), a simple modification to the GRPO objective that clips the minimum baseline to a fixed correctness threshold. We show that baseline clipping introduces a protective bias to advantage estimation that mitigates overfitting while preserving effective exploration. Empirically, CoRPO-trained models improve cross-domain reasoning, generalizing more consistently to out-of-domain (OOD) tasks. When trained on coding tasks, CoRPO outperforms GRPO on math, and vice-versa, indicating that CoRPO learns robust, transferable reasoning patterns rather than task-specific solutions. |
| title | CoRPO: Adding a Correctness Bias to GRPO Improves Generalization |
| topic | Artificial Intelligence Machine Learning |
| url | https://arxiv.org/abs/2511.04439 |