Saved in:
Bibliographic Details
Main Authors: Han, Hojae, Jung, Heeyun, Kim, Jongyoon, Hwang, Seung-won
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.21699
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909030492930048
author Han, Hojae
Jung, Heeyun
Kim, Jongyoon
Hwang, Seung-won
author_facet Han, Hojae
Jung, Heeyun
Kim, Jongyoon
Hwang, Seung-won
contents Multi-turn reasoning agents solve complex questions by decomposing them into intermediate retrieval or tool-use steps, for accumulating supporting evidence across turns. Meanwhile, with reinforcement learning (RL), training these agents rely on many on-policy rollouts and large training batches. Under realistic resource constraints that make dense exploration infeasible, each RL batch contains only few useful reasoning paths from the current policy. Existing approaches do not fully address this bottleneck: SFT-based initialization can overfit when annotated trajectories are scarce, retrieval-level rewards can assign credit to individual retrieved documents without directly optimizing coverage of the full evidence set, and expansion can waste rollouts from poorly chosen prefixes. We introduce David-GRPO, which improves small-batch learning by using information from both outside and inside the current policy: (i) expert bootstrapping injects a few off-policy expert trajectories into RL updates, and (ii) evidence-guided exploration turns on-policy partial successes into evidence-coverage scores and additional continuations. On agents up to 1.5B parameters trained on four RTX 3090 GPUs, David-GRPO improves over prior RL baselines under the same low-budget setting on six multi-hop QA benchmarks. The gains come with a behavioral shift: unlike prior low-budget RL baselines that often skip retrieval or stop after shallow search, David-GRPO learns to increase retrieval depth and evidence coverage.
format Preprint
id arxiv_https___arxiv_org_abs_2601_21699
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
Han, Hojae
Jung, Heeyun
Kim, Jongyoon
Hwang, Seung-won
Computation and Language
Multi-turn reasoning agents solve complex questions by decomposing them into intermediate retrieval or tool-use steps, for accumulating supporting evidence across turns. Meanwhile, with reinforcement learning (RL), training these agents rely on many on-policy rollouts and large training batches. Under realistic resource constraints that make dense exploration infeasible, each RL batch contains only few useful reasoning paths from the current policy. Existing approaches do not fully address this bottleneck: SFT-based initialization can overfit when annotated trajectories are scarce, retrieval-level rewards can assign credit to individual retrieved documents without directly optimizing coverage of the full evidence set, and expansion can waste rollouts from poorly chosen prefixes. We introduce David-GRPO, which improves small-batch learning by using information from both outside and inside the current policy: (i) expert bootstrapping injects a few off-policy expert trajectories into RL updates, and (ii) evidence-guided exploration turns on-policy partial successes into evidence-coverage scores and additional continuations. On agents up to 1.5B parameters trained on four RTX 3090 GPUs, David-GRPO improves over prior RL baselines under the same low-budget setting on six multi-hop QA benchmarks. The gains come with a behavioral shift: unlike prior low-budget RL baselines that often skip retrieval or stop after shallow search, David-GRPO learns to increase retrieval depth and evidence coverage.
title Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
topic Computation and Language
url https://arxiv.org/abs/2601.21699