Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Davydova, Mariya, Jeffries, Daniel, Barker, Patrick, Flores, Arturo Márquez, Ryan, Sinéad
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.03570
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909602665201664
author	Davydova, Mariya Jeffries, Daniel Barker, Patrick Flores, Arturo Márquez Ryan, Sinéad
author_facet	Davydova, Mariya Jeffries, Daniel Barker, Patrick Flores, Arturo Márquez Ryan, Sinéad
contents	In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring of progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon. The source code of the benchmark is available at https://github.com/agentsea/osuniverse.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_03570
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents Davydova, Mariya Jeffries, Daniel Barker, Patrick Flores, Arturo Márquez Ryan, Sinéad Artificial Intelligence In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring of progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon. The source code of the benchmark is available at https://github.com/agentsea/osuniverse.
title	OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents
topic	Artificial Intelligence
url	https://arxiv.org/abs/2505.03570

Similar Items