Saved in:
Bibliographic Details
Main Authors: Guo, Yuyu, Yang, Wenjie, Yang, Siyuan, Liu, Ziyang, Chen, Cheng, Wei, Yuan, Hu, Yun, Huang, Yang, Hao, Guoliang, Yuan, Dongsheng, Wang, Jianming, Chen, Xin, Yu, Hang, Lei, Lei, Di, Peng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.13559
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910178871345152
author Guo, Yuyu
Yang, Wenjie
Yang, Siyuan
Liu, Ziyang
Chen, Cheng
Wei, Yuan
Hu, Yun
Huang, Yang
Hao, Guoliang
Yuan, Dongsheng
Wang, Jianming
Chen, Xin
Yu, Hang
Lei, Lei
Di, Peng
author_facet Guo, Yuyu
Yang, Wenjie
Yang, Siyuan
Liu, Ziyang
Chen, Cheng
Wei, Yuan
Hu, Yun
Huang, Yang
Hao, Guoliang
Yuan, Dongsheng
Wang, Jianming
Chen, Xin
Yu, Hang
Lei, Lei
Di, Peng
contents To fulfill user instructions, autonomous web agents must contend with the inherent complexity and volatile nature of real-world websites. Conventional paradigms predominantly rely on Supervised Fine-Tuning (SFT) or Offline Reinforcement Learning (RL) using static datasets. However, these methods suffer from severe distributional shifts, as offline trajectories fail to capture the stochastic state transitions and real-time feedback of unconstrained wide web environments. In this paper, we propose a robust Online Reinforcement Learning WebAgent, designed to optimize its policy through direct, iterative interactions with unconstrained wide websites. Our approach comprises three core innovations: 1) Hierarchical Multi-Task Fine-tuning: We curate a comprehensive mixture of datasets categorized by functional primitives -- Planning, Acting, and Grounding -- establishing a Vision-Language Model (VLM) with strong instruction-following capabilities for Web GUI tasks. 2) Online Agentic RL in the Wild: We develop an online interaction environment and fine-tune the VLM using a specialized RL pipeline. We introduce a Hybrid Reward Mechanism that combines a ground-truth-agnostic WebJudge for holistic outcome assessment with a Rule-based Decision Tree (RDT) for progress reward. This system effectively mitigates the credit assignment challenge in long-horizon navigation. Notably, our RL-enhanced model achieves a 38.1\% success rate (pass@5) on WebArena, outperforming all existing monolithic baselines. 3) Operator Agent: We introduce a modular agentic framework, namely \textbf{OpAgent}, orchestrating a Planner, Grounder, Reflector, and Summarizer. This synergy enables robust error recovery and self-correction, elevating the agent's performance to a new State-of-the-Art (SOTA) success rate of \textbf{71.6\%}.
format Preprint
id arxiv_https___arxiv_org_abs_2602_13559
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle OpAgent: Operator Agent for Web Navigation
Guo, Yuyu
Yang, Wenjie
Yang, Siyuan
Liu, Ziyang
Chen, Cheng
Wei, Yuan
Hu, Yun
Huang, Yang
Hao, Guoliang
Yuan, Dongsheng
Wang, Jianming
Chen, Xin
Yu, Hang
Lei, Lei
Di, Peng
Artificial Intelligence
To fulfill user instructions, autonomous web agents must contend with the inherent complexity and volatile nature of real-world websites. Conventional paradigms predominantly rely on Supervised Fine-Tuning (SFT) or Offline Reinforcement Learning (RL) using static datasets. However, these methods suffer from severe distributional shifts, as offline trajectories fail to capture the stochastic state transitions and real-time feedback of unconstrained wide web environments. In this paper, we propose a robust Online Reinforcement Learning WebAgent, designed to optimize its policy through direct, iterative interactions with unconstrained wide websites. Our approach comprises three core innovations: 1) Hierarchical Multi-Task Fine-tuning: We curate a comprehensive mixture of datasets categorized by functional primitives -- Planning, Acting, and Grounding -- establishing a Vision-Language Model (VLM) with strong instruction-following capabilities for Web GUI tasks. 2) Online Agentic RL in the Wild: We develop an online interaction environment and fine-tune the VLM using a specialized RL pipeline. We introduce a Hybrid Reward Mechanism that combines a ground-truth-agnostic WebJudge for holistic outcome assessment with a Rule-based Decision Tree (RDT) for progress reward. This system effectively mitigates the credit assignment challenge in long-horizon navigation. Notably, our RL-enhanced model achieves a 38.1\% success rate (pass@5) on WebArena, outperforming all existing monolithic baselines. 3) Operator Agent: We introduce a modular agentic framework, namely \textbf{OpAgent}, orchestrating a Planner, Grounder, Reflector, and Summarizer. This synergy enables robust error recovery and self-correction, elevating the agent's performance to a new State-of-the-Art (SOTA) success rate of \textbf{71.6\%}.
title OpAgent: Operator Agent for Web Navigation
topic Artificial Intelligence
url https://arxiv.org/abs/2602.13559