:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Lian, Yongsheng
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2512.07611
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

GRPO-$λ$: Credit Assignment improves LLM Reasoning
by: Parthasarathi, Prasanna, et al.
Published: (2025)

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks
by: Ramesh, Shyam Sundhar, et al.
Published: (2026)

Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes
by: Bereket, Michael, et al.
Published: (2025)

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
by: Zhang, Xichen, et al.
Published: (2025)

GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO
by: Dipta, Shubhashis Roy, et al.
Published: (2026)

Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training
by: Gong, Xue, et al.
Published: (2026)

S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models
by: Dai, Muzhi, et al.
Published: (2025)

AMIR-GRPO: Inducing Implicit Preference Signals into GRPO
by: Yari, Amir Hossein, et al.
Published: (2026)

ExGRPO: Learning to Reason from Experience
by: Zhan, Runzhe, et al.
Published: (2025)

7B Fully Open Source Moxin-LLM/VLM -- From Pretraining to GRPO-based Reinforcement Learning Enhancement
by: Zhao, Pu, et al.
Published: (2024)

From Reasoning to Code: GRPO Optimization for Underrepresented Languages
by: Pennino, Federico, et al.
Published: (2025)

What is the Alignment Objective of GRPO?
by: Vojnovic, Milan, et al.
Published: (2025)

Mode-Dependent Rectification for Stable PPO Training
by: Mohamad, Mohamad, et al.
Published: (2026)

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
by: Wang, Jingyi, et al.
Published: (2026)

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
by: Chen, Peter, et al.
Published: (2025)

Learning to Tune Pure Pursuit in Autonomous Racing: Joint Lookahead and Steering-Gain Control with PPO
by: Elgouhary, Mohamed, et al.
Published: (2026)

GRPO is Secretly a Process Reward Model
by: Sullivan, Michael, et al.
Published: (2025)

Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents
by: Zhu, Mingkang, et al.
Published: (2025)

CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric
by: Guo, Yunxiao, et al.
Published: (2021)

PPO-Clip Attains Global Optimality: Towards Deeper Understandings of Clipping
by: Huang, Nai-Chieh, et al.
Published: (2023)

An Approximate Ascent Approach To Prove Convergence of PPO
by: Doering, Leif, et al.
Published: (2026)

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization
by: Zheng, Zhi, et al.
Published: (2025)

ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm
by: Wang, Hanyong, et al.
Published: (2026)

Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO
by: Sun, Jing
Published: (2026)

Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients
by: Mansouri, Omar El, et al.
Published: (2025)

A Unified Framework for Rethinking Policy Divergence Measures in GRPO
by: Wu, Qingyuan, et al.
Published: (2026)

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO
by: Ren, Yiming, et al.
Published: (2026)

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
by: Li, Yuming, et al.
Published: (2025)

BinaryPPO: Efficient Policy Optimization for Binary Classification
by: Pandey, Punya Syon, et al.
Published: (2026)

DPO Meets PPO: Reinforced Token Optimization for RLHF
by: Zhong, Han, et al.
Published: (2024)

IB-GRPO: Aligning LLM-based Learning Path Recommendation with Educational Objectives via Indicator-Based Group Relative Policy Optimization
by: Wang, Shuai, et al.
Published: (2026)

When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift
by: Vogt-Lowell, Kevin, et al.
Published: (2026)

When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic
by: Fernández-Hernández, Alberto, et al.
Published: (2026)

TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing
by: Li, Yuanpeng, et al.
Published: (2026)

CoRPO: Adding a Correctness Bias to GRPO Improves Generalization
by: Garg, Anisha, et al.
Published: (2025)

Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward
by: Liu, Zikang, et al.
Published: (2025)

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
by: Chen, Yi, et al.
Published: (2025)

A Robust PPO-optimized Tabular Transformer Framework for Intrusion Detection in Industrial IoT Systems
by: She, Yuanya
Published: (2025)

FPBoost: Fully Parametric Gradient Boosting for Survival Analysis
by: Archetti, Alberto, et al.
Published: (2024)

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
by: Ding, Zheng, et al.
Published: (2025)