Guardado en:
| Autores principales: | Sullivan, Michael, Koller, Alexander |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2509.21154 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients
por: Mansouri, Omar El, et al.
Publicado: (2025)
por: Mansouri, Omar El, et al.
Publicado: (2025)
MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
por: Wei, Kangda, et al.
Publicado: (2026)
por: Wei, Kangda, et al.
Publicado: (2026)
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
por: Xu, Yuanda, et al.
Publicado: (2026)
por: Xu, Yuanda, et al.
Publicado: (2026)
AMIR-GRPO: Inducing Implicit Preference Signals into GRPO
por: Yari, Amir Hossein, et al.
Publicado: (2026)
por: Yari, Amir Hossein, et al.
Publicado: (2026)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
por: Rafailov, Rafael, et al.
Publicado: (2023)
por: Rafailov, Rafael, et al.
Publicado: (2023)
Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
por: Yao, Chaorui, et al.
Publicado: (2025)
por: Yao, Chaorui, et al.
Publicado: (2025)
Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes
por: Bereket, Michael, et al.
Publicado: (2025)
por: Bereket, Michael, et al.
Publicado: (2025)
Adversarial Training for Process Reward Models
por: Juneja, Gurusha, et al.
Publicado: (2025)
por: Juneja, Gurusha, et al.
Publicado: (2025)
Efficient Process Reward Model Training via Active Learning
por: Duan, Keyu, et al.
Publicado: (2025)
por: Duan, Keyu, et al.
Publicado: (2025)
What is the Alignment Objective of GRPO?
por: Vojnovic, Milan, et al.
Publicado: (2025)
por: Vojnovic, Milan, et al.
Publicado: (2025)
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
por: Wang, Jingyi, et al.
Publicado: (2026)
por: Wang, Jingyi, et al.
Publicado: (2026)
EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance
por: Yu, Song, et al.
Publicado: (2026)
por: Yu, Song, et al.
Publicado: (2026)
S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models
por: Dai, Muzhi, et al.
Publicado: (2025)
por: Dai, Muzhi, et al.
Publicado: (2025)
Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO
por: Ren, Yiming, et al.
Publicado: (2026)
por: Ren, Yiming, et al.
Publicado: (2026)
Process Reward Models That Think
por: Khalifa, Muhammad, et al.
Publicado: (2025)
por: Khalifa, Muhammad, et al.
Publicado: (2025)
BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
por: Li, Yuming, et al.
Publicado: (2025)
por: Li, Yuming, et al.
Publicado: (2025)
Accelerating Constrained Decoding with Token Space Compression
por: Sullivan, Michael, et al.
Publicado: (2026)
por: Sullivan, Michael, et al.
Publicado: (2026)
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
por: Ma, Rachel, et al.
Publicado: (2026)
por: Ma, Rachel, et al.
Publicado: (2026)
Process Reward Models for LLM Agents: Practical Framework and Directions
por: Choudhury, Sanjiban
Publicado: (2025)
por: Choudhury, Sanjiban
Publicado: (2025)
TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
por: Ding, Zheng, et al.
Publicado: (2025)
por: Ding, Zheng, et al.
Publicado: (2025)
GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents
por: Wu, Xiongbin, et al.
Publicado: (2026)
por: Wu, Xiongbin, et al.
Publicado: (2026)
AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
por: Huang, Runhui, et al.
Publicado: (2026)
por: Huang, Runhui, et al.
Publicado: (2026)
GRPO-$λ$: Credit Assignment improves LLM Reasoning
por: Parthasarathi, Prasanna, et al.
Publicado: (2025)
por: Parthasarathi, Prasanna, et al.
Publicado: (2025)
CoRPO: Adding a Correctness Bias to GRPO Improves Generalization
por: Garg, Anisha, et al.
Publicado: (2025)
por: Garg, Anisha, et al.
Publicado: (2025)
DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning
por: Cao, Qi, et al.
Publicado: (2025)
por: Cao, Qi, et al.
Publicado: (2025)
Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards
por: Groeneveld, Jan Niklas, et al.
Publicado: (2025)
por: Groeneveld, Jan Niklas, et al.
Publicado: (2025)
Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models
por: Peysakhovich, Alexander, et al.
Publicado: (2026)
por: Peysakhovich, Alexander, et al.
Publicado: (2026)
Know What You Don't Know: Uncertainty Calibration of Process Reward Models
por: Park, Young-Jin, et al.
Publicado: (2025)
por: Park, Young-Jin, et al.
Publicado: (2025)
A Unified Framework for Rethinking Policy Divergence Measures in GRPO
por: Wu, Qingyuan, et al.
Publicado: (2026)
por: Wu, Qingyuan, et al.
Publicado: (2026)
On the Ability of Transformers to Verify Plans
por: Sarrof, Yash, et al.
Publicado: (2026)
por: Sarrof, Yash, et al.
Publicado: (2026)
The Lessons of Developing Process Reward Models in Mathematical Reasoning
por: Zhang, Zhenru, et al.
Publicado: (2025)
por: Zhang, Zhenru, et al.
Publicado: (2025)
LLM Reasoning with Process Rewards for Outcome-Guided Steps
por: Rezaei, Mohammad, et al.
Publicado: (2026)
por: Rezaei, Mohammad, et al.
Publicado: (2026)
Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes
por: Trapasso, Alessandro, et al.
Publicado: (2025)
por: Trapasso, Alessandro, et al.
Publicado: (2025)
Process Rewards with Learned Reliability
por: Li, Jinyuan, et al.
Publicado: (2026)
por: Li, Jinyuan, et al.
Publicado: (2026)
Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward
por: Liu, Zikang, et al.
Publicado: (2025)
por: Liu, Zikang, et al.
Publicado: (2025)
ADaPT: As-Needed Decomposition and Planning with Language Models
por: Prasad, Archiki, et al.
Publicado: (2023)
por: Prasad, Archiki, et al.
Publicado: (2023)
Exploration Through Introspection: A Self-Aware Reward Model
por: Petrowski, Michael, et al.
Publicado: (2026)
por: Petrowski, Michael, et al.
Publicado: (2026)
ExGRPO: Learning to Reason from Experience
por: Zhan, Runzhe, et al.
Publicado: (2025)
por: Zhan, Runzhe, et al.
Publicado: (2025)
Unlocking Multimodal Mathematical Reasoning via Process Reward Model
por: Luo, Ruilin, et al.
Publicado: (2025)
por: Luo, Ruilin, et al.
Publicado: (2025)
Efficient Process Reward Modeling via Contrastive Mutual Information
por: Lee, Nakyung, et al.
Publicado: (2026)
por: Lee, Nakyung, et al.
Publicado: (2026)
Ejemplares similares
-
Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients
por: Mansouri, Omar El, et al.
Publicado: (2025) -
MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
por: Wei, Kangda, et al.
Publicado: (2026) -
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
por: Xu, Yuanda, et al.
Publicado: (2026) -
AMIR-GRPO: Inducing Implicit Preference Signals into GRPO
por: Yari, Amir Hossein, et al.
Publicado: (2026) -
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
por: Rafailov, Rafael, et al.
Publicado: (2023)