Saved in:
| Main Authors: | Xie, Yutao, Thomas, Nathaniel, Hansen, Nicklas, Fu, Yang, Li, Li Erran, Wang, Xiaolong |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.22293 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
by: Wu, Fang, et al.
Published: (2025)
by: Wu, Fang, et al.
Published: (2025)
Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025)
by: Fu, Jiayi, et al.
Published: (2025)
Reward-Robust RLHF in LLMs
by: Yan, Yuzi, et al.
Published: (2024)
by: Yan, Yuzi, et al.
Published: (2024)
Teaching LLMs for Step-Level Automatic Math Correction via Reinforcement Learning
by: Li, Junsong, et al.
Published: (2025)
by: Li, Junsong, et al.
Published: (2025)
Text2Reward: Reward Shaping with Language Models for Reinforcement Learning
by: Xie, Tianbao, et al.
Published: (2023)
by: Xie, Tianbao, et al.
Published: (2023)
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
by: Liang, Zihan, et al.
Published: (2026)
by: Liang, Zihan, et al.
Published: (2026)
ARGS: Alignment as Reward-Guided Search
by: Khanov, Maxim, et al.
Published: (2024)
by: Khanov, Maxim, et al.
Published: (2024)
Selective Preference Optimization via Token-Level Reward Function Estimation
by: Yang, Kailai, et al.
Published: (2024)
by: Yang, Kailai, et al.
Published: (2024)
Multi-Turn Code Generation Through Single-Step Rewards
by: Jain, Arnav Kumar, et al.
Published: (2025)
by: Jain, Arnav Kumar, et al.
Published: (2025)
KnowCoder-X: Boosting Multilingual Information Extraction via Code
by: Zuo, Yuxin, et al.
Published: (2024)
by: Zuo, Yuxin, et al.
Published: (2024)
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
by: Li, Manling, et al.
Published: (2024)
by: Li, Manling, et al.
Published: (2024)
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
by: Liu, Yang, et al.
Published: (2026)
by: Liu, Yang, et al.
Published: (2026)
DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs
by: Cattan, Arie, et al.
Published: (2025)
by: Cattan, Arie, et al.
Published: (2025)
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
by: Wu, Yuning, et al.
Published: (2026)
by: Wu, Yuning, et al.
Published: (2026)
Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners
by: Peng, Miao, et al.
Published: (2025)
by: Peng, Miao, et al.
Published: (2025)
LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization
by: Li, Junsong, et al.
Published: (2025)
by: Li, Junsong, et al.
Published: (2025)
Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective
by: He, Shenghua, et al.
Published: (2025)
by: He, Shenghua, et al.
Published: (2025)
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents
by: Wang, Guoqing, et al.
Published: (2025)
by: Wang, Guoqing, et al.
Published: (2025)
On the Shape of Brainscores for Large Language Models (LLMs)
by: Li, Jingkai
Published: (2024)
by: Li, Jingkai
Published: (2024)
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective
by: Ou, Jingyang, et al.
Published: (2025)
by: Ou, Jingyang, et al.
Published: (2025)
TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
by: Djuhera, Aladin, et al.
Published: (2026)
by: Djuhera, Aladin, et al.
Published: (2026)
Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
by: Liu, Wei, et al.
Published: (2025)
by: Liu, Wei, et al.
Published: (2025)
Reward Is Enough: LLMs Are In-Context Reinforcement Learners
by: Song, Kefan, et al.
Published: (2025)
by: Song, Kefan, et al.
Published: (2025)
Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data
by: Ling, Zhenqing, et al.
Published: (2025)
by: Ling, Zhenqing, et al.
Published: (2025)
T-REG: Preference Optimization with Token-Level Reward Regularization
by: Zhou, Wenxuan, et al.
Published: (2024)
by: Zhou, Wenxuan, et al.
Published: (2024)
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
by: Yang, Wenkai, et al.
Published: (2026)
by: Yang, Wenkai, et al.
Published: (2026)
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
by: Li, Zhuowan, et al.
Published: (2024)
by: Li, Zhuowan, et al.
Published: (2024)
RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
by: Liu, Wanlong, et al.
Published: (2024)
by: Liu, Wanlong, et al.
Published: (2024)
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
by: Ye, Guanghao, et al.
Published: (2025)
by: Ye, Guanghao, et al.
Published: (2025)
DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue
by: Zhang, Feiyuan, et al.
Published: (2025)
by: Zhang, Feiyuan, et al.
Published: (2025)
Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation
by: Dong, Guanting, et al.
Published: (2024)
by: Dong, Guanting, et al.
Published: (2024)
GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
by: Tan, Hongze, et al.
Published: (2025)
by: Tan, Hongze, et al.
Published: (2025)
Language Models Can Reduce Asymmetry in Information Markets
by: Rahaman, Nasim, et al.
Published: (2024)
by: Rahaman, Nasim, et al.
Published: (2024)
AGR: Age Group fairness Reward for Bias Mitigation in LLMs
by: Cao, Shuirong, et al.
Published: (2024)
by: Cao, Shuirong, et al.
Published: (2024)
Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines
by: Jørgensen, Mikkel Godsk, et al.
Published: (2026)
by: Jørgensen, Mikkel Godsk, et al.
Published: (2026)
Process Rewards with Learned Reliability
by: Li, Jinyuan, et al.
Published: (2026)
by: Li, Jinyuan, et al.
Published: (2026)
Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level Importance Sampling and Clipping-Triggered Normalization
by: Li, Chenliang, et al.
Published: (2025)
by: Li, Chenliang, et al.
Published: (2025)
TD-MPC2: Scalable, Robust World Models for Continuous Control
by: Hansen, Nicklas, et al.
Published: (2023)
by: Hansen, Nicklas, et al.
Published: (2023)
Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation
by: Merth, Thomas, et al.
Published: (2024)
by: Merth, Thomas, et al.
Published: (2024)
Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction
by: Xu, Jun, et al.
Published: (2025)
by: Xu, Jun, et al.
Published: (2025)
Similar Items
-
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
by: Wu, Fang, et al.
Published: (2025) -
Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025) -
Reward-Robust RLHF in LLMs
by: Yan, Yuzi, et al.
Published: (2024) -
Teaching LLMs for Step-Level Automatic Math Correction via Reinforcement Learning
by: Li, Junsong, et al.
Published: (2025) -
Text2Reward: Reward Shaping with Language Models for Reinforcement Learning
by: Xie, Tianbao, et al.
Published: (2023)