:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Sullivan, Michael, Koller, Alexander
Formato:	Preprint
Publicado:	2025
Materias:	Machine Learning Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2509.21154
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients
por: Mansouri, Omar El, et al.
Publicado: (2025)

MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
por: Wei, Kangda, et al.
Publicado: (2026)

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
por: Xu, Yuanda, et al.
Publicado: (2026)

AMIR-GRPO: Inducing Implicit Preference Signals into GRPO
por: Yari, Amir Hossein, et al.
Publicado: (2026)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model
por: Rafailov, Rafael, et al.
Publicado: (2023)

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
por: Yao, Chaorui, et al.
Publicado: (2025)

Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes
por: Bereket, Michael, et al.
Publicado: (2025)

Adversarial Training for Process Reward Models
por: Juneja, Gurusha, et al.
Publicado: (2025)

Efficient Process Reward Model Training via Active Learning
por: Duan, Keyu, et al.
Publicado: (2025)

What is the Alignment Objective of GRPO?
por: Vojnovic, Milan, et al.
Publicado: (2025)

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
por: Wang, Jingyi, et al.
Publicado: (2026)

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance
por: Yu, Song, et al.
Publicado: (2026)

S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models
por: Dai, Muzhi, et al.
Publicado: (2025)

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO
por: Ren, Yiming, et al.
Publicado: (2026)

Process Reward Models That Think
por: Khalifa, Muhammad, et al.
Publicado: (2025)

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
por: Li, Yuming, et al.
Publicado: (2025)

Accelerating Constrained Decoding with Token Space Compression
por: Sullivan, Michael, et al.
Publicado: (2026)

Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
por: Ma, Rachel, et al.
Publicado: (2026)

Process Reward Models for LLM Agents: Practical Framework and Directions
por: Choudhury, Sanjiban
Publicado: (2025)

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
por: Ding, Zheng, et al.
Publicado: (2025)

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents
por: Wu, Xiongbin, et al.
Publicado: (2026)

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
por: Huang, Runhui, et al.
Publicado: (2026)

GRPO-$λ$: Credit Assignment improves LLM Reasoning
por: Parthasarathi, Prasanna, et al.
Publicado: (2025)

CoRPO: Adding a Correctness Bias to GRPO Improves Generalization
por: Garg, Anisha, et al.
Publicado: (2025)

DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning
por: Cao, Qi, et al.
Publicado: (2025)

Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards
por: Groeneveld, Jan Niklas, et al.
Publicado: (2025)

Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models
por: Peysakhovich, Alexander, et al.
Publicado: (2026)

Know What You Don't Know: Uncertainty Calibration of Process Reward Models
por: Park, Young-Jin, et al.
Publicado: (2025)

A Unified Framework for Rethinking Policy Divergence Measures in GRPO
por: Wu, Qingyuan, et al.
Publicado: (2026)

On the Ability of Transformers to Verify Plans
por: Sarrof, Yash, et al.
Publicado: (2026)

The Lessons of Developing Process Reward Models in Mathematical Reasoning
por: Zhang, Zhenru, et al.
Publicado: (2025)

LLM Reasoning with Process Rewards for Outcome-Guided Steps
por: Rezaei, Mohammad, et al.
Publicado: (2026)

Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes
por: Trapasso, Alessandro, et al.
Publicado: (2025)

Process Rewards with Learned Reliability
por: Li, Jinyuan, et al.
Publicado: (2026)

Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward
por: Liu, Zikang, et al.
Publicado: (2025)

ADaPT: As-Needed Decomposition and Planning with Language Models
por: Prasad, Archiki, et al.
Publicado: (2023)

Exploration Through Introspection: A Self-Aware Reward Model
por: Petrowski, Michael, et al.
Publicado: (2026)

ExGRPO: Learning to Reason from Experience
por: Zhan, Runzhe, et al.
Publicado: (2025)

Unlocking Multimodal Mathematical Reasoning via Process Reward Model
por: Luo, Ruilin, et al.
Publicado: (2025)

Efficient Process Reward Modeling via Contrastive Mutual Information
por: Lee, Nakyung, et al.
Publicado: (2026)