Saved in:
| Main Authors: | Rahman, Salman, Gorantla, Sruthi, Gupta, Arpit, Roy, Swastik, Peng, Nanyun, Liu, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.03244 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline
by: Seegmiller, Parker, et al.
Published: (2025)
by: Seegmiller, Parker, et al.
Published: (2025)
Process Reinforcement through Implicit Rewards
by: Cui, Ganqu, et al.
Published: (2025)
by: Cui, Ganqu, et al.
Published: (2025)
Rubric-Guided Process Reward for Stepwise Model Routing
by: Ye, Shenghao, et al.
Published: (2026)
by: Ye, Shenghao, et al.
Published: (2026)
Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning
by: Fei, Wu, et al.
Published: (2025)
by: Fei, Wu, et al.
Published: (2025)
Process Rewards with Learned Reliability
by: Li, Jinyuan, et al.
Published: (2026)
by: Li, Jinyuan, et al.
Published: (2026)
Text2Reward: Reward Shaping with Language Models for Reinforcement Learning
by: Xie, Tianbao, et al.
Published: (2023)
by: Xie, Tianbao, et al.
Published: (2023)
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
by: Kunde, Vishnu Teja, et al.
Published: (2026)
by: Kunde, Vishnu Teja, et al.
Published: (2026)
StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason
by: Zhang, Kaiyi, et al.
Published: (2025)
by: Zhang, Kaiyi, et al.
Published: (2025)
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
by: Park, Jungsoo, et al.
Published: (2026)
by: Park, Jungsoo, et al.
Published: (2026)
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
by: Yang, Wenkai, et al.
Published: (2025)
by: Yang, Wenkai, et al.
Published: (2025)
HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning
by: Lu, Zhicong, et al.
Published: (2026)
by: Lu, Zhicong, et al.
Published: (2026)
The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs
by: Bandarkar, Lucas, et al.
Published: (2025)
by: Bandarkar, Lucas, et al.
Published: (2025)
Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment
by: Cheng, Ruoxi, et al.
Published: (2025)
by: Cheng, Ruoxi, et al.
Published: (2025)
Mitigating Bias for Question Answering Models by Tracking Bias Influence
by: Ma, Mingyu Derek, et al.
Published: (2023)
by: Ma, Mingyu Derek, et al.
Published: (2023)
Process Reward Models That Think
by: Khalifa, Muhammad, et al.
Published: (2025)
by: Khalifa, Muhammad, et al.
Published: (2025)
RLCD: Reinforcement Learning from Contrastive Distillation for Language Model Alignment
by: Yang, Kevin, et al.
Published: (2023)
by: Yang, Kevin, et al.
Published: (2023)
Reinforcement Learning with Conditional Expectation Reward
by: Xiao, Changyi, et al.
Published: (2026)
by: Xiao, Changyi, et al.
Published: (2026)
REFA: Reference Free Alignment for multi-preference optimization
by: Gupta, Taneesh, et al.
Published: (2024)
by: Gupta, Taneesh, et al.
Published: (2024)
Multilingual Routing in Mixture-of-Experts
by: Bandarkar, Lucas, et al.
Published: (2025)
by: Bandarkar, Lucas, et al.
Published: (2025)
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
by: Gunjal, Anisha, et al.
Published: (2025)
by: Gunjal, Anisha, et al.
Published: (2025)
ReCode: Reinforcing Code Generation with Reasoning-Process Rewards
by: Fan, Lishui, et al.
Published: (2025)
by: Fan, Lishui, et al.
Published: (2025)
Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts
by: Martin, Liu O., et al.
Published: (2026)
by: Martin, Liu O., et al.
Published: (2026)
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
by: Rizvi, Md Imbesat Hassan, et al.
Published: (2025)
by: Rizvi, Md Imbesat Hassan, et al.
Published: (2025)
Reward Is Enough: LLMs Are In-Context Reinforcement Learners
by: Song, Kefan, et al.
Published: (2025)
by: Song, Kefan, et al.
Published: (2025)
SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation
by: Yang, Wenjie, et al.
Published: (2025)
by: Yang, Wenjie, et al.
Published: (2025)
Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning
by: Ye, Zhiling, et al.
Published: (2025)
by: Ye, Zhiling, et al.
Published: (2025)
PoPE: Legendre Orthogonal Polynomials Based Position Encoding for Large Language Models
by: Aggarwal, Arpit
Published: (2024)
by: Aggarwal, Arpit
Published: (2024)
PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization
by: Rahman, Ben
Published: (2025)
by: Rahman, Ben
Published: (2025)
RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
by: Yang, Daniel, et al.
Published: (2026)
by: Yang, Daniel, et al.
Published: (2026)
TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks
by: Kapoor, Vansh, et al.
Published: (2026)
by: Kapoor, Vansh, et al.
Published: (2026)
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
by: Ackermann, Johannes, et al.
Published: (2026)
by: Ackermann, Johannes, et al.
Published: (2026)
Model Extrapolation Expedites Alignment
by: Zheng, Chujie, et al.
Published: (2024)
by: Zheng, Chujie, et al.
Published: (2024)
Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners
by: Peng, Miao, et al.
Published: (2025)
by: Peng, Miao, et al.
Published: (2025)
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
by: Ma, Zhengzhao, et al.
Published: (2026)
by: Ma, Zhengzhao, et al.
Published: (2026)
Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards
by: Nguyen, Hieu Trung, et al.
Published: (2026)
by: Nguyen, Hieu Trung, et al.
Published: (2026)
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
by: Stojanovski, Zafir, et al.
Published: (2025)
by: Stojanovski, Zafir, et al.
Published: (2025)
CARL: Criticality-Aware Agentic Reinforcement Learning
by: Shen, Leyang, et al.
Published: (2025)
by: Shen, Leyang, et al.
Published: (2025)
More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
by: Cao, Lang, et al.
Published: (2025)
by: Cao, Lang, et al.
Published: (2025)
PhonologyBench: Evaluating Phonological Skills of Large Language Models
by: Suvarna, Ashima, et al.
Published: (2024)
by: Suvarna, Ashima, et al.
Published: (2024)
DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning
by: Parekh, Tanmay, et al.
Published: (2025)
by: Parekh, Tanmay, et al.
Published: (2025)
Similar Items
-
FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline
by: Seegmiller, Parker, et al.
Published: (2025) -
Process Reinforcement through Implicit Rewards
by: Cui, Ganqu, et al.
Published: (2025) -
Rubric-Guided Process Reward for Stepwise Model Routing
by: Ye, Shenghao, et al.
Published: (2026) -
Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning
by: Fei, Wu, et al.
Published: (2025) -
Process Rewards with Learned Reliability
by: Li, Jinyuan, et al.
Published: (2026)