Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Le, Xiao, Yixiong, Lu, Xinjiang, Cao, Jingjia, Zhao, Yusai, Zhou, Jingbo, An, Lang, Feng, Zikan, Sha, Wanxiang, Shi, Yu, Xiao, Congxi, Xiong, Jian, Zhang, Yankai, Wu, Hua, Wang, Haifeng
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.20380
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917227174821888
author	Zhang, Le Xiao, Yixiong Lu, Xinjiang Cao, Jingjia Zhao, Yusai Zhou, Jingbo An, Lang Feng, Zikan Sha, Wanxiang Shi, Yu Xiao, Congxi Xiong, Jian Zhang, Yankai Wu, Hua Wang, Haifeng
author_facet	Zhang, Le Xiao, Yixiong Lu, Xinjiang Cao, Jingjia Zhao, Yusai Zhou, Jingbo An, Lang Feng, Zikan Sha, Wanxiang Shi, Yu Xiao, Congxi Xiong, Jian Zhang, Yankai Wu, Hua Wang, Haifeng
contents	Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_20380
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution Zhang, Le Xiao, Yixiong Lu, Xinjiang Cao, Jingjia Zhao, Yusai Zhou, Jingbo An, Lang Feng, Zikan Sha, Wanxiang Shi, Yu Xiao, Congxi Xiong, Jian Zhang, Yankai Wu, Hua Wang, Haifeng Artificial Intelligence Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.
title	OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution
topic	Artificial Intelligence
url	https://arxiv.org/abs/2601.20380

Similar Items