Saved in:
| Main Author: | Lian, Yongsheng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.07611 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
GRPO-$λ$: Credit Assignment improves LLM Reasoning
by: Parthasarathi, Prasanna, et al.
Published: (2025)
by: Parthasarathi, Prasanna, et al.
Published: (2025)
Multi-Task GRPO: Reliable LLM Reasoning Across Tasks
by: Ramesh, Shyam Sundhar, et al.
Published: (2026)
by: Ramesh, Shyam Sundhar, et al.
Published: (2026)
Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes
by: Bereket, Michael, et al.
Published: (2025)
by: Bereket, Michael, et al.
Published: (2025)
Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
by: Zhang, Xichen, et al.
Published: (2025)
by: Zhang, Xichen, et al.
Published: (2025)
GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO
by: Dipta, Shubhashis Roy, et al.
Published: (2026)
by: Dipta, Shubhashis Roy, et al.
Published: (2026)
Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training
by: Gong, Xue, et al.
Published: (2026)
by: Gong, Xue, et al.
Published: (2026)
S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models
by: Dai, Muzhi, et al.
Published: (2025)
by: Dai, Muzhi, et al.
Published: (2025)
AMIR-GRPO: Inducing Implicit Preference Signals into GRPO
by: Yari, Amir Hossein, et al.
Published: (2026)
by: Yari, Amir Hossein, et al.
Published: (2026)
ExGRPO: Learning to Reason from Experience
by: Zhan, Runzhe, et al.
Published: (2025)
by: Zhan, Runzhe, et al.
Published: (2025)
7B Fully Open Source Moxin-LLM/VLM -- From Pretraining to GRPO-based Reinforcement Learning Enhancement
by: Zhao, Pu, et al.
Published: (2024)
by: Zhao, Pu, et al.
Published: (2024)
From Reasoning to Code: GRPO Optimization for Underrepresented Languages
by: Pennino, Federico, et al.
Published: (2025)
by: Pennino, Federico, et al.
Published: (2025)
What is the Alignment Objective of GRPO?
by: Vojnovic, Milan, et al.
Published: (2025)
by: Vojnovic, Milan, et al.
Published: (2025)
Mode-Dependent Rectification for Stable PPO Training
by: Mohamad, Mohamad, et al.
Published: (2026)
by: Mohamad, Mohamad, et al.
Published: (2026)
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
by: Wang, Jingyi, et al.
Published: (2026)
by: Wang, Jingyi, et al.
Published: (2026)
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
by: Chen, Peter, et al.
Published: (2025)
by: Chen, Peter, et al.
Published: (2025)
Learning to Tune Pure Pursuit in Autonomous Racing: Joint Lookahead and Steering-Gain Control with PPO
by: Elgouhary, Mohamed, et al.
Published: (2026)
by: Elgouhary, Mohamed, et al.
Published: (2026)
GRPO is Secretly a Process Reward Model
by: Sullivan, Michael, et al.
Published: (2025)
by: Sullivan, Michael, et al.
Published: (2025)
Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents
by: Zhu, Mingkang, et al.
Published: (2025)
by: Zhu, Mingkang, et al.
Published: (2025)
CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric
by: Guo, Yunxiao, et al.
Published: (2021)
by: Guo, Yunxiao, et al.
Published: (2021)
PPO-Clip Attains Global Optimality: Towards Deeper Understandings of Clipping
by: Huang, Nai-Chieh, et al.
Published: (2023)
by: Huang, Nai-Chieh, et al.
Published: (2023)
An Approximate Ascent Approach To Prove Convergence of PPO
by: Doering, Leif, et al.
Published: (2026)
by: Doering, Leif, et al.
Published: (2026)
SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization
by: Zheng, Zhi, et al.
Published: (2025)
by: Zheng, Zhi, et al.
Published: (2025)
ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm
by: Wang, Hanyong, et al.
Published: (2026)
by: Wang, Hanyong, et al.
Published: (2026)
Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO
by: Sun, Jing
Published: (2026)
by: Sun, Jing
Published: (2026)
Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients
by: Mansouri, Omar El, et al.
Published: (2025)
by: Mansouri, Omar El, et al.
Published: (2025)
A Unified Framework for Rethinking Policy Divergence Measures in GRPO
by: Wu, Qingyuan, et al.
Published: (2026)
by: Wu, Qingyuan, et al.
Published: (2026)
Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO
by: Ren, Yiming, et al.
Published: (2026)
by: Ren, Yiming, et al.
Published: (2026)
BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
by: Li, Yuming, et al.
Published: (2025)
by: Li, Yuming, et al.
Published: (2025)
BinaryPPO: Efficient Policy Optimization for Binary Classification
by: Pandey, Punya Syon, et al.
Published: (2026)
by: Pandey, Punya Syon, et al.
Published: (2026)
DPO Meets PPO: Reinforced Token Optimization for RLHF
by: Zhong, Han, et al.
Published: (2024)
by: Zhong, Han, et al.
Published: (2024)
IB-GRPO: Aligning LLM-based Learning Path Recommendation with Educational Objectives via Indicator-Based Group Relative Policy Optimization
by: Wang, Shuai, et al.
Published: (2026)
by: Wang, Shuai, et al.
Published: (2026)
When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift
by: Vogt-Lowell, Kevin, et al.
Published: (2026)
by: Vogt-Lowell, Kevin, et al.
Published: (2026)
When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic
by: Fernández-Hernández, Alberto, et al.
Published: (2026)
by: Fernández-Hernández, Alberto, et al.
Published: (2026)
TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing
by: Li, Yuanpeng, et al.
Published: (2026)
by: Li, Yuanpeng, et al.
Published: (2026)
CoRPO: Adding a Correctness Bias to GRPO Improves Generalization
by: Garg, Anisha, et al.
Published: (2025)
by: Garg, Anisha, et al.
Published: (2025)
Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward
by: Liu, Zikang, et al.
Published: (2025)
by: Liu, Zikang, et al.
Published: (2025)
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
by: Chen, Yi, et al.
Published: (2025)
by: Chen, Yi, et al.
Published: (2025)
A Robust PPO-optimized Tabular Transformer Framework for Intrusion Detection in Industrial IoT Systems
by: She, Yuanya
Published: (2025)
by: She, Yuanya
Published: (2025)
FPBoost: Fully Parametric Gradient Boosting for Survival Analysis
by: Archetti, Alberto, et al.
Published: (2024)
by: Archetti, Alberto, et al.
Published: (2024)
TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
by: Ding, Zheng, et al.
Published: (2025)
by: Ding, Zheng, et al.
Published: (2025)
Similar Items
-
GRPO-$λ$: Credit Assignment improves LLM Reasoning
by: Parthasarathi, Prasanna, et al.
Published: (2025) -
Multi-Task GRPO: Reliable LLM Reasoning Across Tasks
by: Ramesh, Shyam Sundhar, et al.
Published: (2026) -
Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes
by: Bereket, Michael, et al.
Published: (2025) -
Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
by: Zhang, Xichen, et al.
Published: (2025) -
GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO
by: Dipta, Shubhashis Roy, et al.
Published: (2026)