Saved in:
Bibliographic Details
Main Authors: Wu, Zheng, Hua, Yi, Huang, Zhaoyuan, Xue, Chenhao, Lu, Yijie, Cheng, Pengzhou, Wu, Zongru, Dong, Lingzhong, Liu, Gongshen, Jiang, Xinghao, Zhang, Zhuosheng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.24348
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911626127474688
author Wu, Zheng
Hua, Yi
Huang, Zhaoyuan
Xue, Chenhao
Lu, Yijie
Cheng, Pengzhou
Wu, Zongru
Dong, Lingzhong
Liu, Gongshen
Jiang, Xinghao
Zhang, Zhuosheng
author_facet Wu, Zheng
Hua, Yi
Huang, Zhaoyuan
Xue, Chenhao
Lu, Yijie
Cheng, Pengzhou
Wu, Zongru
Dong, Lingzhong
Liu, Gongshen
Jiang, Xinghao
Zhang, Zhuosheng
contents The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.
format Preprint
id arxiv_https___arxiv_org_abs_2604_24348
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
Wu, Zheng
Hua, Yi
Huang, Zhaoyuan
Xue, Chenhao
Lu, Yijie
Cheng, Pengzhou
Wu, Zongru
Dong, Lingzhong
Liu, Gongshen
Jiang, Xinghao
Zhang, Zhuosheng
Computation and Language
The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.
title OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
topic Computation and Language
url https://arxiv.org/abs/2604.24348