Saved in:
| Main Authors: | Hoy, William, Wang, Binxu, Pan, Xu |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.01499 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LithoGRPO: Fast Inverse Lithography via GRPO Reinforced Flow Matching
by: Lai, Yao, et al.
Published: (2026)
by: Lai, Yao, et al.
Published: (2026)
S-GRPO: Unified Post-Training for Large Vision-Language Models
by: Yan, Yuming, et al.
Published: (2026)
by: Yan, Yuming, et al.
Published: (2026)
TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
by: Ding, Zheng, et al.
Published: (2025)
by: Ding, Zheng, et al.
Published: (2025)
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
by: Xu, Yuanda, et al.
Published: (2026)
by: Xu, Yuanda, et al.
Published: (2026)
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
by: Wang, Binxu, et al.
Published: (2026)
by: Wang, Binxu, et al.
Published: (2026)
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training
by: Wang, Yuanyi, et al.
Published: (2026)
by: Wang, Yuanyi, et al.
Published: (2026)
SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
by: Su, Xiaole, et al.
Published: (2026)
by: Su, Xiaole, et al.
Published: (2026)
Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets
by: Pikus, Benjamin, et al.
Published: (2025)
by: Pikus, Benjamin, et al.
Published: (2025)
Graph-GRPO: Training Graph Flow Models with Reinforcement Learning
by: Zhu, Baoheng, et al.
Published: (2026)
by: Zhu, Baoheng, et al.
Published: (2026)
MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference
by: Li, Yu, et al.
Published: (2026)
by: Li, Yu, et al.
Published: (2026)
How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning
by: Tian, Minghao, et al.
Published: (2026)
by: Tian, Minghao, et al.
Published: (2026)
Stepwise Credit Assignment for GRPO on Flow-Matching Models
by: Savani, Yash, et al.
Published: (2026)
by: Savani, Yash, et al.
Published: (2026)
STABLE: Gated Continual Learning for Large Language Models
by: Hoy, William, et al.
Published: (2025)
by: Hoy, William, et al.
Published: (2025)
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
by: Hu, Zhengding, et al.
Published: (2026)
by: Hu, Zhengding, et al.
Published: (2026)
MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
by: Wei, Kangda, et al.
Published: (2026)
by: Wei, Kangda, et al.
Published: (2026)
PostTrainBench: Can LLM Agents Automate LLM Post-Training?
by: Rank, Ben, et al.
Published: (2026)
by: Rank, Ben, et al.
Published: (2026)
GRPO-$λ$: Credit Assignment improves LLM Reasoning
by: Parthasarathi, Prasanna, et al.
Published: (2025)
by: Parthasarathi, Prasanna, et al.
Published: (2025)
Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models
by: Nimmaturi, Datta, et al.
Published: (2025)
by: Nimmaturi, Datta, et al.
Published: (2025)
GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
by: Wang, Jing, et al.
Published: (2025)
by: Wang, Jing, et al.
Published: (2025)
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
by: Hu, Pingbang, et al.
Published: (2026)
by: Hu, Pingbang, et al.
Published: (2026)
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
by: Shi, Hengyu, et al.
Published: (2026)
by: Shi, Hengyu, et al.
Published: (2026)
AMIR-GRPO: Inducing Implicit Preference Signals into GRPO
by: Yari, Amir Hossein, et al.
Published: (2026)
by: Yari, Amir Hossein, et al.
Published: (2026)
Accuracy vs. Accuracy: Computational Tradeoffs Between Classification Rates and Utility
by: Amit, Noga, et al.
Published: (2025)
by: Amit, Noga, et al.
Published: (2025)
Multi-Task GRPO: Reliable LLM Reasoning Across Tasks
by: Ramesh, Shyam Sundhar, et al.
Published: (2026)
by: Ramesh, Shyam Sundhar, et al.
Published: (2026)
Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them
by: Rajani, Neel, et al.
Published: (2025)
by: Rajani, Neel, et al.
Published: (2025)
DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning
by: Chen, Xiwen, et al.
Published: (2025)
by: Chen, Xiwen, et al.
Published: (2025)
GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning
by: Xu, Yanchen, et al.
Published: (2025)
by: Xu, Yanchen, et al.
Published: (2025)
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training
by: Du, Xianzhi, et al.
Published: (2024)
by: Du, Xianzhi, et al.
Published: (2024)
Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward
by: Liu, Zikang, et al.
Published: (2025)
by: Liu, Zikang, et al.
Published: (2025)
Prompt Curriculum Learning for Efficient LLM Post-Training
by: Gao, Zhaolin, et al.
Published: (2025)
by: Gao, Zhaolin, et al.
Published: (2025)
On the Evolution of Federated Post-Training Large Language Models: A Model Accessibility View
by: Guo, Tao, et al.
Published: (2025)
by: Guo, Tao, et al.
Published: (2025)
Consolidating Rewarded Perturbations for LLM Post-Training
by: Zhang, Zheyu, et al.
Published: (2026)
by: Zhang, Zheyu, et al.
Published: (2026)
Automatic Configuration of LLM Post-Training Pipelines
by: Chwa, Channe, et al.
Published: (2026)
by: Chwa, Channe, et al.
Published: (2026)
AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training
by: Han, Zhenyu, et al.
Published: (2025)
by: Han, Zhenyu, et al.
Published: (2025)
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
by: Bergmeister, Andreas, et al.
Published: (2026)
by: Bergmeister, Andreas, et al.
Published: (2026)
Divergence Minimization Preference Optimization for Diffusion Model Alignment
by: Li, Binxu, et al.
Published: (2025)
by: Li, Binxu, et al.
Published: (2025)
f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
by: Haldar, Rajdeep, et al.
Published: (2026)
by: Haldar, Rajdeep, et al.
Published: (2026)
An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion Models
by: Wang, Binxu, et al.
Published: (2025)
by: Wang, Binxu, et al.
Published: (2025)
CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training
by: Thede, Lukas, et al.
Published: (2026)
by: Thede, Lukas, et al.
Published: (2026)
Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement
by: Lian, Yongsheng
Published: (2025)
by: Lian, Yongsheng
Published: (2025)
Similar Items
-
LithoGRPO: Fast Inverse Lithography via GRPO Reinforced Flow Matching
by: Lai, Yao, et al.
Published: (2026) -
S-GRPO: Unified Post-Training for Large Vision-Language Models
by: Yan, Yuming, et al.
Published: (2026) -
TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
by: Ding, Zheng, et al.
Published: (2025) -
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
by: Xu, Yuanda, et al.
Published: (2026) -
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
by: Wang, Binxu, et al.
Published: (2026)