Enregistré dans:
Détails bibliographiques
Auteurs principaux: Lin, Mingxiong, Gong, Zhangquan, Tang, Maowen, Li, Qian, Wang, Chuangchuang, Ma, Jian, Huang, Sutian, Tang, Kai, Lu, Haonan
Format: Preprint
Publié: 2026
Sujets:
Accès en ligne:https://arxiv.org/abs/2605.11403
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866917483394367488
author Lin, Mingxiong
Gong, Zhangquan
Tang, Maowen
Li, Qian
Wang, Chuangchuang
Ma, Jian
Huang, Sutian
Tang, Kai
Lu, Haonan
author_facet Lin, Mingxiong
Gong, Zhangquan
Tang, Maowen
Li, Qian
Wang, Chuangchuang
Ma, Jian
Huang, Sutian
Tang, Kai
Lu, Haonan
contents Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model achieves satisfactory results. Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at a moderate accuracy level around 0.5, focusing model training on its learning frontier. We conduct evaluations on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base across six mainstream mathematical reasoning benchmarks. Experimental results demonstrate that FG-ExPO consistently outperforms vanilla GRPO. It delivers an absolute improvement of 13.34 on the AIME 2025 pass@32 metric, rising from 63.33 percent to 76.67 percent, and obtains an average pass@32 gain of 2.66 on the 8B model. The substantially larger performance gains observed on pass@32 compared to pass@1 verify that FG-ExPO enlarges the model's effective exploration space under a fixed inference budget.
format Preprint
id arxiv_https___arxiv_org_abs_2605_11403
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
Lin, Mingxiong
Gong, Zhangquan
Tang, Maowen
Li, Qian
Wang, Chuangchuang
Ma, Jian
Huang, Sutian
Tang, Kai
Lu, Haonan
Machine Learning
Artificial Intelligence
Computation and Language
Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model achieves satisfactory results. Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at a moderate accuracy level around 0.5, focusing model training on its learning frontier. We conduct evaluations on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base across six mainstream mathematical reasoning benchmarks. Experimental results demonstrate that FG-ExPO consistently outperforms vanilla GRPO. It delivers an absolute improvement of 13.34 on the AIME 2025 pass@32 metric, rising from 63.33 percent to 76.67 percent, and obtains an average pass@32 gain of 2.66 on the 8B model. The substantially larger performance gains observed on pass@32 compared to pass@1 verify that FG-ExPO enlarges the model's effective exploration space under a fixed inference budget.
title fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
topic Machine Learning
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2605.11403