:: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xie, Yutao, Thomas, Nathaniel, Hansen, Nicklas, Fu, Yang, Li, Li Erran, Wang, Xiaolong
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2603.22293
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
by: Wu, Fang, et al.
Published: (2025)

Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025)

Reward-Robust RLHF in LLMs
by: Yan, Yuzi, et al.
Published: (2024)

Teaching LLMs for Step-Level Automatic Math Correction via Reinforcement Learning
by: Li, Junsong, et al.
Published: (2025)

Text2Reward: Reward Shaping with Language Models for Reinforcement Learning
by: Xie, Tianbao, et al.
Published: (2023)

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
by: Liang, Zihan, et al.
Published: (2026)

ARGS: Alignment as Reward-Guided Search
by: Khanov, Maxim, et al.
Published: (2024)

Selective Preference Optimization via Token-Level Reward Function Estimation
by: Yang, Kailai, et al.
Published: (2024)

Multi-Turn Code Generation Through Single-Step Rewards
by: Jain, Arnav Kumar, et al.
Published: (2025)

KnowCoder-X: Boosting Multilingual Information Extraction via Code
by: Zuo, Yuxin, et al.
Published: (2024)

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
by: Li, Manling, et al.
Published: (2024)

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
by: Liu, Yang, et al.
Published: (2026)

DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs
by: Cattan, Arie, et al.
Published: (2025)

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
by: Wu, Yuning, et al.
Published: (2026)

Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners
by: Peng, Miao, et al.
Published: (2025)

LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization
by: Li, Junsong, et al.
Published: (2025)

Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective
by: He, Shenghua, et al.
Published: (2025)

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents
by: Wang, Guoqing, et al.
Published: (2025)

On the Shape of Brainscores for Large Language Models (LLMs)
by: Li, Jingkai
Published: (2024)

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective
by: Ou, Jingyang, et al.
Published: (2025)

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
by: Djuhera, Aladin, et al.
Published: (2026)

Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
by: Liu, Wei, et al.
Published: (2025)

Reward Is Enough: LLMs Are In-Context Reinforcement Learners
by: Song, Kefan, et al.
Published: (2025)

Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data
by: Ling, Zhenqing, et al.
Published: (2025)

T-REG: Preference Optimization with Token-Level Reward Regularization
by: Zhou, Wenxuan, et al.
Published: (2024)

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
by: Yang, Wenkai, et al.
Published: (2026)

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
by: Li, Zhuowan, et al.
Published: (2024)

RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
by: Liu, Wanlong, et al.
Published: (2024)

On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
by: Ye, Guanghao, et al.
Published: (2025)

DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue
by: Zhang, Feiyuan, et al.
Published: (2025)

Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation
by: Dong, Guanting, et al.
Published: (2024)

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
by: Tan, Hongze, et al.
Published: (2025)

Language Models Can Reduce Asymmetry in Information Markets
by: Rahaman, Nasim, et al.
Published: (2024)

AGR: Age Group fairness Reward for Bias Mitigation in LLMs
by: Cao, Shuirong, et al.
Published: (2024)

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines
by: Jørgensen, Mikkel Godsk, et al.
Published: (2026)

Process Rewards with Learned Reliability
by: Li, Jinyuan, et al.
Published: (2026)

Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level Importance Sampling and Clipping-Triggered Normalization
by: Li, Chenliang, et al.
Published: (2025)

TD-MPC2: Scalable, Robust World Models for Continuous Control
by: Hansen, Nicklas, et al.
Published: (2023)

Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation
by: Merth, Thomas, et al.
Published: (2024)

Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction
by: Xu, Jun, et al.
Published: (2025)