Medarbejdervisning: :: Library Catalog

Saved in:

Bibliografiske detaljer
Main Authors:	Luo, Weidi, Zhang, Qiming, Lu, Tianyu, Liu, Xiaogeng, Hu, Bin, Chiu, Hung-Chun, Ma, Siyuan, Zhang, Yizhe, Xiao, Xusheng, Cao, Yinzhi, Xiang, Zhen, Xiao, Chaowei
Format:	Preprint
Udgivet:	2025
Fag:	Cryptography and Security
Online adgang:	https://arxiv.org/abs/2510.06607
Tags:	Tilføj Tag Ingen Tags, Vær først til at tagge denne postø!

_version_	1866914085227986944
author	Luo, Weidi Zhang, Qiming Lu, Tianyu Liu, Xiaogeng Hu, Bin Chiu, Hung-Chun Ma, Siyuan Zhang, Yizhe Xiao, Xusheng Cao, Yinzhi Xiang, Zhen Xiao, Chaowei
author_facet	Luo, Weidi Zhang, Qiming Lu, Tianyu Liu, Xiaogeng Hu, Bin Chiu, Hung-Chun Ma, Siyuan Zhang, Yizhe Xiao, Xusheng Cao, Yinzhi Xiang, Zhen Xiao, Chaowei
contents	Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_06607
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent Luo, Weidi Zhang, Qiming Lu, Tianyu Liu, Xiaogeng Hu, Bin Chiu, Hung-Chun Ma, Siyuan Zhang, Yizhe Xiao, Xusheng Cao, Yinzhi Xiang, Zhen Xiao, Chaowei Cryptography and Security Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.
title	Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent
topic	Cryptography and Security
url	https://arxiv.org/abs/2510.06607

Lignende værker