Saved in:
| Main Authors: | Feng, Zihao, Wang, Xiaoxue, Bai, Ziwei, Su, Donghang, Wu, Bowen, Yu, Qun, Wang, Baoxun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.13592 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ToolSample: Dual Dynamic Sampling Methods with Curriculum Learning for RL-based Tool Learning
by: Feng, Zihao, et al.
Published: (2025)
by: Feng, Zihao, et al.
Published: (2025)
RAIDEN-R1: Improving Role-awareness of LLMs via GRPO with Verifiable Reward
by: Wang, Zongsheng, et al.
Published: (2025)
by: Wang, Zongsheng, et al.
Published: (2025)
Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History
by: Wu, Bowen, et al.
Published: (2025)
by: Wu, Bowen, et al.
Published: (2025)
LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward
by: Zhao, Yi, et al.
Published: (2025)
by: Zhao, Yi, et al.
Published: (2025)
GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection
by: Zhang, Shuguang, et al.
Published: (2026)
by: Zhang, Shuguang, et al.
Published: (2026)
GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
by: Tan, Hongze, et al.
Published: (2025)
by: Tan, Hongze, et al.
Published: (2025)
Towards the Holographic Characteristic of LLMs for Efficient Short-text Generation
by: Qian, Shun, et al.
Published: (2026)
by: Qian, Shun, et al.
Published: (2026)
Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
by: Jiang, Shuyang, et al.
Published: (2025)
by: Jiang, Shuyang, et al.
Published: (2025)
Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation
by: Yang, Zixuan, et al.
Published: (2026)
by: Yang, Zixuan, et al.
Published: (2026)
MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
by: Wei, Kangda, et al.
Published: (2026)
by: Wei, Kangda, et al.
Published: (2026)
$λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences
by: Wang, Yining, et al.
Published: (2025)
by: Wang, Yining, et al.
Published: (2025)
S-GRPO: Unified Post-Training for Large Vision-Language Models
by: Yan, Yuming, et al.
Published: (2026)
by: Yan, Yuming, et al.
Published: (2026)
GRPO with State Mutations: Improving LLM-Based Hardware Test Plan Generation
by: Kochar, Dimple Vijay, et al.
Published: (2026)
by: Kochar, Dimple Vijay, et al.
Published: (2026)
The Bidirectional Process Reward Model
by: Zhang, Lingyin, et al.
Published: (2025)
by: Zhang, Lingyin, et al.
Published: (2025)
Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning
by: Deng, Jingcheng, et al.
Published: (2026)
by: Deng, Jingcheng, et al.
Published: (2026)
DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning
by: Chen, Xiwen, et al.
Published: (2025)
by: Chen, Xiwen, et al.
Published: (2025)
Learning to Explain: Prototype-Based Surrogate Models for LLM Classification
by: Wei, Bowen, et al.
Published: (2025)
by: Wei, Bowen, et al.
Published: (2025)
Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
by: Pappone, Francesco, et al.
Published: (2025)
by: Pappone, Francesco, et al.
Published: (2025)
Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL with GRPO
by: Kattamuri, Ashish, et al.
Published: (2025)
by: Kattamuri, Ashish, et al.
Published: (2025)
Can GRPO Boost Complex Multimodal Table Understanding?
by: Kang, Xiaoqiang, et al.
Published: (2025)
by: Kang, Xiaoqiang, et al.
Published: (2025)
SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment
by: Tang, Yixuan, et al.
Published: (2025)
by: Tang, Yixuan, et al.
Published: (2025)
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
by: Sun, Xiaohui, et al.
Published: (2025)
by: Sun, Xiaohui, et al.
Published: (2025)
It Takes Two: Your GRPO Is Secretly DPO
by: Wu, Yihong, et al.
Published: (2025)
by: Wu, Yihong, et al.
Published: (2025)
MT-RewardTree: A Comprehensive Framework for Advancing LLM-Based Machine Translation via Reward Modeling
by: Feng, Zhaopeng, et al.
Published: (2025)
by: Feng, Zhaopeng, et al.
Published: (2025)
Libra: Assessing and Improving Reward Model by Learning to Think
by: Zhou, Meng, et al.
Published: (2025)
by: Zhou, Meng, et al.
Published: (2025)
Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents
by: Wang, Ziyi, et al.
Published: (2026)
by: Wang, Ziyi, et al.
Published: (2026)
Intent-Driven Semantic ID Generation for Grounded Conversational News Recommendation
by: Su, Hongyang, et al.
Published: (2026)
by: Su, Hongyang, et al.
Published: (2026)
Bridging Thoughts and Words: Graph-Based Intent-Semantic Joint Learning for Fake News Detection
by: Wang, Zhengjia, et al.
Published: (2025)
by: Wang, Zhengjia, et al.
Published: (2025)
GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO
by: Dipta, Shubhashis Roy, et al.
Published: (2026)
by: Dipta, Shubhashis Roy, et al.
Published: (2026)
ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding
by: Sun, Zhongxiang, et al.
Published: (2025)
by: Sun, Zhongxiang, et al.
Published: (2025)
Detecting Conversational Mental Manipulation with Intent-Aware Prompting
by: Ma, Jiayuan, et al.
Published: (2024)
by: Ma, Jiayuan, et al.
Published: (2024)
Towards Understanding the Influence of Reward Margin on Preference Model Performance
by: Qin, Bowen, et al.
Published: (2024)
by: Qin, Bowen, et al.
Published: (2024)
Multi-Reward GRPO Fine-Tuning for De-biasing Large Language Models: A Study Based on Chinese-Context Discrimination Data
by: Yixuan, Deng, et al.
Published: (2025)
by: Yixuan, Deng, et al.
Published: (2025)
Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement
by: Cheng, Zihao, et al.
Published: (2024)
by: Cheng, Zihao, et al.
Published: (2024)
Advancing Interpretability in Text Classification through Prototype Learning
by: Wei, Bowen, et al.
Published: (2024)
by: Wei, Bowen, et al.
Published: (2024)
M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization
by: Bai, Bizhe, et al.
Published: (2025)
by: Bai, Bizhe, et al.
Published: (2025)
Beyond the Known: Investigating LLMs Performance on Out-of-Domain Intent Detection
by: Wang, Pei, et al.
Published: (2024)
by: Wang, Pei, et al.
Published: (2024)
Generate then Refine: Data Augmentation for Zero-shot Intent Detection
by: Lin, I-Fan, et al.
Published: (2024)
by: Lin, I-Fan, et al.
Published: (2024)
Intent-driven In-context Learning for Few-shot Dialogue State Tracking
by: Yi, Zihao, et al.
Published: (2024)
by: Yi, Zihao, et al.
Published: (2024)
Intent Detection in the Age of LLMs
by: Arora, Gaurav, et al.
Published: (2024)
by: Arora, Gaurav, et al.
Published: (2024)
Similar Items
-
ToolSample: Dual Dynamic Sampling Methods with Curriculum Learning for RL-based Tool Learning
by: Feng, Zihao, et al.
Published: (2025) -
RAIDEN-R1: Improving Role-awareness of LLMs via GRPO with Verifiable Reward
by: Wang, Zongsheng, et al.
Published: (2025) -
Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History
by: Wu, Bowen, et al.
Published: (2025) -
LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward
by: Zhao, Yi, et al.
Published: (2025) -
GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection
by: Zhang, Shuguang, et al.
Published: (2026)