Saved in:
Bibliographic Details
Main Authors: Zhang, Le, Xiao, Yixiong, Lu, Xinjiang, Cao, Jingjia, Zhao, Yusai, Zhou, Jingbo, An, Lang, Feng, Zikan, Sha, Wanxiang, Shi, Yu, Xiao, Congxi, Xiong, Jian, Zhang, Yankai, Wu, Hua, Wang, Haifeng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.20380
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917227174821888
author Zhang, Le
Xiao, Yixiong
Lu, Xinjiang
Cao, Jingjia
Zhao, Yusai
Zhou, Jingbo
An, Lang
Feng, Zikan
Sha, Wanxiang
Shi, Yu
Xiao, Congxi
Xiong, Jian
Zhang, Yankai
Wu, Hua
Wang, Haifeng
author_facet Zhang, Le
Xiao, Yixiong
Lu, Xinjiang
Cao, Jingjia
Zhao, Yusai
Zhou, Jingbo
An, Lang
Feng, Zikan
Sha, Wanxiang
Shi, Yu
Xiao, Congxi
Xiong, Jian
Zhang, Yankai
Wu, Hua
Wang, Haifeng
contents Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.
format Preprint
id arxiv_https___arxiv_org_abs_2601_20380
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution
Zhang, Le
Xiao, Yixiong
Lu, Xinjiang
Cao, Jingjia
Zhao, Yusai
Zhou, Jingbo
An, Lang
Feng, Zikan
Sha, Wanxiang
Shi, Yu
Xiao, Congxi
Xiong, Jian
Zhang, Yankai
Wu, Hua
Wang, Haifeng
Artificial Intelligence
Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.
title OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution
topic Artificial Intelligence
url https://arxiv.org/abs/2601.20380