Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yan, Haolong, Shen, Yeqing, Huang, Xin, Wang, Jia, Tan, Kaijun, Liang, Zhixuan, Li, Hongxin, Ge, Zheng, Yoshie, Osamu, Li, Si, Zhang, Xiangyu, Jiang, Daxin
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.02423
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908687074852864
author	Yan, Haolong Shen, Yeqing Huang, Xin Wang, Jia Tan, Kaijun Liang, Zhixuan Li, Hongxin Ge, Zheng Yoshie, Osamu Li, Si Zhang, Xiangyu Jiang, Daxin
author_facet	Yan, Haolong Shen, Yeqing Huang, Xin Wang, Jia Tan, Kaijun Liang, Zhixuan Li, Hongxin Ge, Zheng Yoshie, Osamu Li, Si Zhang, Xiangyu Jiang, Daxin
contents	With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_02423
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning Yan, Haolong Shen, Yeqing Huang, Xin Wang, Jia Tan, Kaijun Liang, Zhixuan Li, Hongxin Ge, Zheng Yoshie, Osamu Li, Si Zhang, Xiangyu Jiang, Daxin Computer Vision and Pattern Recognition With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.
title	GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2512.02423

Similar Items