Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Xie, Yutao, Thomas, Nathaniel, Hansen, Nicklas, Fu, Yang, Li, Li Erran, Wang, Xiaolong
Formato:	Preprint
Publicado:	2026
Materias:	Computation and Language Artificial Intelligence Machine Learning
Acceso en línea:	https://arxiv.org/abs/2603.22293
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866914414727266304
author	Xie, Yutao Thomas, Nathaniel Hansen, Nicklas Fu, Yang Li, Li Erran Wang, Xiaolong
author_facet	Xie, Yutao Thomas, Nathaniel Hansen, Nicklas Fu, Yang Li, Li Erran Wang, Xiaolong
contents	Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_22293
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs Xie, Yutao Thomas, Nathaniel Hansen, Nicklas Fu, Yang Li, Li Erran Wang, Xiaolong Computation and Language Artificial Intelligence Machine Learning Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.
title	TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs
topic	Computation and Language Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2603.22293

Ejemplares similares