Saved in:
Bibliographic Details
Main Authors: Yu, Zhaoning, Su, Will, Tao, Leitian, Wang, Haozhu, Singh, Aashu, Yu, Hanchao, Wang, Jianyu, Gao, Hongyang, Yuan, Weizhe, Weston, Jason, Yu, Ping, Xu, Jing
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.02172
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915530879795200
author Yu, Zhaoning
Su, Will
Tao, Leitian
Wang, Haozhu
Singh, Aashu
Yu, Hanchao
Wang, Jianyu
Gao, Hongyang
Yuan, Weizhe
Weston, Jason
Yu, Ping
Xu, Jing
author_facet Yu, Zhaoning
Su, Will
Tao, Leitian
Wang, Haozhu
Singh, Aashu
Yu, Hanchao
Wang, Jianyu
Gao, Hongyang
Yuan, Weizhe
Weston, Jason
Yu, Ping
Xu, Jing
contents Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.
format Preprint
id arxiv_https___arxiv_org_abs_2510_02172
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization
Yu, Zhaoning
Su, Will
Tao, Leitian
Wang, Haozhu
Singh, Aashu
Yu, Hanchao
Wang, Jianyu
Gao, Hongyang
Yuan, Weizhe
Weston, Jason
Yu, Ping
Xu, Jing
Computation and Language
Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.
title RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization
topic Computation and Language
url https://arxiv.org/abs/2510.02172