Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Yuchen, Feng, Yingjie, Qin, Lixiong, Chen, Jiasi, Yu, Jianing, Gao, Sheng, Yang, Sheng, Xu, Weiran
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.29697
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-level reward methods typically rely on costly tree sampling. We view world knowledge as a latent world graph and each IS task as search within a latent task graph, where effective steps should make graph progress toward the answer node. Based on this prior, we propose Graph-Distance Contribution Reward (GDCR), a step-level process reward that scores newly-retrieved and newly-cited entities by their distance to the answer node in a training-time Entity-Relation (ER) graph. We further propose Step Advantage Policy Optimization (SAPO), which converts GDCR into step-level advantages and combines them with trajectory-level outcome advantages. Experiments on four challenging benchmarks validate the effectiveness of our method.

Similar Items