Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Lin, Mingxiong, Gong, Zhangquan, Tang, Maowen, Li, Qian, Wang, Chuangchuang, Ma, Jian, Huang, Sutian, Tang, Kai, Lu, Haonan
Format:	Preprint
Publié:	2026
Sujets:	Machine Learning Artificial Intelligence Computation and Language
Accès en ligne:	https://arxiv.org/abs/2605.11403
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866917483394367488
author	Lin, Mingxiong Gong, Zhangquan Tang, Maowen Li, Qian Wang, Chuangchuang Ma, Jian Huang, Sutian Tang, Kai Lu, Haonan
author_facet	Lin, Mingxiong Gong, Zhangquan Tang, Maowen Li, Qian Wang, Chuangchuang Ma, Jian Huang, Sutian Tang, Kai Lu, Haonan
contents	Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model achieves satisfactory results. Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at a moderate accuracy level around 0.5, focusing model training on its learning frontier. We conduct evaluations on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base across six mainstream mathematical reasoning benchmarks. Experimental results demonstrate that FG-ExPO consistently outperforms vanilla GRPO. It delivers an absolute improvement of 13.34 on the AIME 2025 pass@32 metric, rising from 63.33 percent to 76.67 percent, and obtains an average pass@32 gain of 2.66 on the 8B model. The substantially larger performance gains observed on pass@32 compared to pass@1 verify that FG-ExPO enlarges the model's effective exploration space under a fixed inference budget.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_11403
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum Lin, Mingxiong Gong, Zhangquan Tang, Maowen Li, Qian Wang, Chuangchuang Ma, Jian Huang, Sutian Tang, Kai Lu, Haonan Machine Learning Artificial Intelligence Computation and Language Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model achieves satisfactory results. Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at a moderate accuracy level around 0.5, focusing model training on its learning frontier. We conduct evaluations on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base across six mainstream mathematical reasoning benchmarks. Experimental results demonstrate that FG-ExPO consistently outperforms vanilla GRPO. It delivers an absolute improvement of 13.34 on the AIME 2025 pass@32 metric, rising from 63.33 percent to 76.67 percent, and obtains an average pass@32 gain of 2.66 on the 8B model. The substantially larger performance gains observed on pass@32 compared to pass@1 verify that FG-ExPO enlarges the model's effective exploration space under a fixed inference budget.
title	fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
topic	Machine Learning Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2605.11403

Documents similaires