Saved in:
| Main Authors: | Wang, Zongsheng, Sun, Kaili, Wu, Bowen, Yu, Qun, Li, Ying, Wang, Baoxun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.10218 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling
by: Feng, Zihao, et al.
Published: (2025)
by: Feng, Zihao, et al.
Published: (2025)
Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History
by: Wu, Bowen, et al.
Published: (2025)
by: Wu, Bowen, et al.
Published: (2025)
ToolSample: Dual Dynamic Sampling Methods with Curriculum Learning for RL-based Tool Learning
by: Feng, Zihao, et al.
Published: (2025)
by: Feng, Zihao, et al.
Published: (2025)
LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward
by: Zhao, Yi, et al.
Published: (2025)
by: Zhao, Yi, et al.
Published: (2025)
Towards the Holographic Characteristic of LLMs for Efficient Short-text Generation
by: Qian, Shun, et al.
Published: (2026)
by: Qian, Shun, et al.
Published: (2026)
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
by: Sun, Xiaohui, et al.
Published: (2025)
by: Sun, Xiaohui, et al.
Published: (2025)
Lessons from Training Grounded LLMs with Verifiable Rewards
by: Sim, Shang Hong, et al.
Published: (2025)
by: Sim, Shang Hong, et al.
Published: (2025)
GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
by: Tan, Hongze, et al.
Published: (2025)
by: Tan, Hongze, et al.
Published: (2025)
RM-R1: Reward Modeling as Reasoning
by: Chen, Xiusi, et al.
Published: (2025)
by: Chen, Xiusi, et al.
Published: (2025)
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs
by: Wen, Xumeng, et al.
Published: (2025)
by: Wen, Xumeng, et al.
Published: (2025)
bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs
by: Ji, Wence, et al.
Published: (2025)
by: Ji, Wence, et al.
Published: (2025)
Logic-Regularized Verifier Elicits Reasoning from LLMs
by: Wang, Xinyu, et al.
Published: (2026)
by: Wang, Xinyu, et al.
Published: (2026)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
by: Liu, Shudong, et al.
Published: (2025)
by: Liu, Shudong, et al.
Published: (2025)
Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards
by: Zhang, Xin, et al.
Published: (2026)
by: Zhang, Xin, et al.
Published: (2026)
Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards
by: Lara, Luis, et al.
Published: (2026)
by: Lara, Luis, et al.
Published: (2026)
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
by: Su, Yi, et al.
Published: (2025)
by: Su, Yi, et al.
Published: (2025)
From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation
by: Jiang, Yuxin, et al.
Published: (2026)
by: Jiang, Yuxin, et al.
Published: (2026)
Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
by: Zhang, Yimeng, et al.
Published: (2025)
by: Zhang, Yimeng, et al.
Published: (2025)
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards
by: Liu, Xiaoyuan, et al.
Published: (2025)
by: Liu, Xiaoyuan, et al.
Published: (2025)
DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning
by: Chen, Xiwen, et al.
Published: (2025)
by: Chen, Xiwen, et al.
Published: (2025)
LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards
by: Ping, Bowen, et al.
Published: (2026)
by: Ping, Bowen, et al.
Published: (2026)
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
by: Yao, Huanjin, et al.
Published: (2025)
by: Yao, Huanjin, et al.
Published: (2025)
PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards
by: Ghimire, Mukesh, et al.
Published: (2026)
by: Ghimire, Mukesh, et al.
Published: (2026)
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
by: Yan, Kai, et al.
Published: (2026)
by: Yan, Kai, et al.
Published: (2026)
ConfClip: Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs
by: Zhang, Bonan, et al.
Published: (2025)
by: Zhang, Bonan, et al.
Published: (2025)
MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
by: Wei, Kangda, et al.
Published: (2026)
by: Wei, Kangda, et al.
Published: (2026)
$λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences
by: Wang, Yining, et al.
Published: (2025)
by: Wang, Yining, et al.
Published: (2025)
S-GRPO: Unified Post-Training for Large Vision-Language Models
by: Yan, Yuming, et al.
Published: (2026)
by: Yan, Yuming, et al.
Published: (2026)
Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation
by: Zhou, Jiang, et al.
Published: (2026)
by: Zhou, Jiang, et al.
Published: (2026)
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
by: Bensal, Shelly, et al.
Published: (2025)
by: Bensal, Shelly, et al.
Published: (2025)
Improving Value-based Process Verifier via Structural Prior Injection
by: Sun, Zetian, et al.
Published: (2025)
by: Sun, Zetian, et al.
Published: (2025)
Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning
by: Tian, Changyuan, et al.
Published: (2025)
by: Tian, Changyuan, et al.
Published: (2025)
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
by: Ren, Mengjie, et al.
Published: (2026)
by: Ren, Mengjie, et al.
Published: (2026)
RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
by: Wang, Peisong, et al.
Published: (2025)
by: Wang, Peisong, et al.
Published: (2025)
IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards
by: Guo, Xu, et al.
Published: (2025)
by: Guo, Xu, et al.
Published: (2025)
Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
by: Pappone, Francesco, et al.
Published: (2025)
by: Pappone, Francesco, et al.
Published: (2025)
Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL with GRPO
by: Kattamuri, Ashish, et al.
Published: (2025)
by: Kattamuri, Ashish, et al.
Published: (2025)
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
by: Peng, Hao, et al.
Published: (2025)
by: Peng, Hao, et al.
Published: (2025)
Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards
by: Liu, Shuze Daniel, et al.
Published: (2026)
by: Liu, Shuze Daniel, et al.
Published: (2026)
CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization
by: Ye, Xinge, et al.
Published: (2025)
by: Ye, Xinge, et al.
Published: (2025)
Similar Items
-
Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling
by: Feng, Zihao, et al.
Published: (2025) -
Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History
by: Wu, Bowen, et al.
Published: (2025) -
ToolSample: Dual Dynamic Sampling Methods with Curriculum Learning for RL-based Tool Learning
by: Feng, Zihao, et al.
Published: (2025) -
LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward
by: Zhao, Yi, et al.
Published: (2025) -
Towards the Holographic Characteristic of LLMs for Efficient Short-text Generation
by: Qian, Shun, et al.
Published: (2026)