Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Zhang, Junshuo, Huang, Chengrui, Guo, Feng, Li, Zihan, Shi, Ke, Jiang, Menghua, Yu, Jiguo, Shang, Shuo, Gao, Shen
Format:	Preprint
Publié:	2026
Sujets:	Computation and Language
Accès en ligne:	https://arxiv.org/abs/2604.24320
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866910169363906560
author	Zhang, Junshuo Huang, Chengrui Guo, Feng Li, Zihan Shi, Ke Jiang, Menghua Yu, Jiguo Shang, Shuo Gao, Shen
author_facet	Zhang, Junshuo Huang, Chengrui Guo, Feng Li, Zihan Shi, Ke Jiang, Menghua Yu, Jiguo Shang, Shuo Gao, Shen
contents	Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_24320
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents Zhang, Junshuo Huang, Chengrui Guo, Feng Li, Zihan Shi, Ke Jiang, Menghua Yu, Jiguo Shang, Shuo Gao, Shen Computation and Language Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)
title	DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents
topic	Computation and Language
url	https://arxiv.org/abs/2604.24320

Documents similaires