Saved in:
Bibliographic Details
Main Authors: Wu, Huyu, Liu, Jun, Wei, Xiaochi, Gao, Yan, Wu, Yi, Hu, Yao
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.05702
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914537233448960
author Wu, Huyu
Liu, Jun
Wei, Xiaochi
Gao, Yan
Wu, Yi
Hu, Yao
author_facet Wu, Huyu
Liu, Jun
Wei, Xiaochi
Gao, Yan
Wu, Yi
Hu, Yao
contents Self-evolving search agents reduce reliance on human-written training questions by generating and solving their own search tasks. We build on Search Self-Play (SSP), a representative Proposer and Solver framework in which questions are generated and answered via multi-step search and reasoning. In practice, however, SSP faces two bottlenecks: the Proposer constructs questions from isolated answer entities without relational context, yielding many invalid or unverifiable questions in early self-play training, while the Solver receives only a binary outcome reward that discards useful signal from partially on-track search trajectories. We address both bottlenecks by reusing knowledge-graph paths as construction-derived intermediate supervision for both question construction and reward shaping. First, we ground question construction in LLM-guided knowledge-graph subgraphs, providing relational context for the Proposer. Second, we observe that constructing and solving a multi-hop question can involve overlapping intermediate entities: the factual bridges used to formulate the question may provide approximate waypoints for answering it. Exploiting this overlap, we introduce Waypoint Coverage Reward (WCR), which grants graded partial credit to incorrect Solver trajectories according to their coverage of entities on the construction path, while preserving full reward for correct answers. Across seven QA benchmarks and nine model configurations, our approach improves the average score over standard SSP in all configurations, including notable gains on multi-hop QA tasks. These results suggest that knowledge-graph paths can be reused as lightweight intermediate supervision, providing both relational guidance and process feedback without additional task-specific human annotations or manually labeled process steps.
format Preprint
id arxiv_https___arxiv_org_abs_2605_05702
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents
Wu, Huyu
Liu, Jun
Wei, Xiaochi
Gao, Yan
Wu, Yi
Hu, Yao
Artificial Intelligence
Self-evolving search agents reduce reliance on human-written training questions by generating and solving their own search tasks. We build on Search Self-Play (SSP), a representative Proposer and Solver framework in which questions are generated and answered via multi-step search and reasoning. In practice, however, SSP faces two bottlenecks: the Proposer constructs questions from isolated answer entities without relational context, yielding many invalid or unverifiable questions in early self-play training, while the Solver receives only a binary outcome reward that discards useful signal from partially on-track search trajectories. We address both bottlenecks by reusing knowledge-graph paths as construction-derived intermediate supervision for both question construction and reward shaping. First, we ground question construction in LLM-guided knowledge-graph subgraphs, providing relational context for the Proposer. Second, we observe that constructing and solving a multi-hop question can involve overlapping intermediate entities: the factual bridges used to formulate the question may provide approximate waypoints for answering it. Exploiting this overlap, we introduce Waypoint Coverage Reward (WCR), which grants graded partial credit to incorrect Solver trajectories according to their coverage of entities on the construction path, while preserving full reward for correct answers. Across seven QA benchmarks and nine model configurations, our approach improves the average score over standard SSP in all configurations, including notable gains on multi-hop QA tasks. These results suggest that knowledge-graph paths can be reused as lightweight intermediate supervision, providing both relational guidance and process feedback without additional task-specific human annotations or manually labeled process steps.
title Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents
topic Artificial Intelligence
url https://arxiv.org/abs/2605.05702