Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Zheng, Hua, Yi, Huang, Zhaoyuan, Xue, Chenhao, Lu, Yijie, Cheng, Pengzhou, Wu, Zongru, Dong, Lingzhong, Liu, Gongshen, Jiang, Xinghao, Zhang, Zhuosheng
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2604.24348
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911626127474688
author	Wu, Zheng Hua, Yi Huang, Zhaoyuan Xue, Chenhao Lu, Yijie Cheng, Pengzhou Wu, Zongru Dong, Lingzhong Liu, Gongshen Jiang, Xinghao Zhang, Zhuosheng
author_facet	Wu, Zheng Hua, Yi Huang, Zhaoyuan Xue, Chenhao Lu, Yijie Cheng, Pengzhou Wu, Zongru Dong, Lingzhong Liu, Gongshen Jiang, Xinghao Zhang, Zhuosheng
contents	The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_24348
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents Wu, Zheng Hua, Yi Huang, Zhaoyuan Xue, Chenhao Lu, Yijie Cheng, Pengzhou Wu, Zongru Dong, Lingzhong Liu, Gongshen Jiang, Xinghao Zhang, Zhuosheng Computation and Language The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.
title	OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
topic	Computation and Language
url	https://arxiv.org/abs/2604.24348

Similar Items