Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xing, Mingzhe, Zhang, Rongkai, Xue, Hui, Chen, Qi, Yang, Fan, Xiao, Zhen
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Human-Computer Interaction Software Engineering
Online Access:	https://arxiv.org/abs/2402.06596
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909100463357952
author	Xing, Mingzhe Zhang, Rongkai Xue, Hui Chen, Qi Yang, Fan Xiao, Zhen
author_facet	Xing, Mingzhe Zhang, Rongkai Xue, Hui Chen, Qi Yang, Fan Xiao, Zhen
contents	Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. However, when applied to general-purpose software systems like operating systems, LLM agents face three primary challenges. Firstly, the action space is vast and dynamic, posing difficulties for LLM agents to maintain an up-to-date understanding and deliver accurate responses. Secondly, real-world tasks often require inter-application cooperation}, demanding farsighted planning from LLM agents. Thirdly, agents need to identify optimal solutions aligning with user constraints, such as security concerns and preferences. These challenges motivate AndroidArena, an environment and benchmark designed to evaluate LLM agents on a modern operating system. To address high-cost of manpower, we design a scalable and semi-automated method to construct the benchmark. In the task evaluation, AndroidArena incorporates accurate and adaptive metrics to address the issue of non-unique solutions. Our findings reveal that even state-of-the-art LLM agents struggle in cross-APP scenarios and adhering to specific constraints. Additionally, we identify a lack of four key capabilities, i.e., understanding, reasoning, exploration, and reflection, as primary reasons for the failure of LLM agents. Furthermore, we provide empirical analysis on the failure of reflection, and improve the success rate by 27% with our proposed exploration strategy. This work is the first to present valuable insights in understanding fine-grained weakness of LLM agents, and offers a path forward for future research in this area. Environment, benchmark, and evaluation code for AndroidArena are released at https://github.com/AndroidArenaAgent/AndroidArena.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_06596
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Understanding the Weakness of Large Language Model Agents within a Complex Android Environment Xing, Mingzhe Zhang, Rongkai Xue, Hui Chen, Qi Yang, Fan Xiao, Zhen Artificial Intelligence Human-Computer Interaction Software Engineering Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. However, when applied to general-purpose software systems like operating systems, LLM agents face three primary challenges. Firstly, the action space is vast and dynamic, posing difficulties for LLM agents to maintain an up-to-date understanding and deliver accurate responses. Secondly, real-world tasks often require inter-application cooperation}, demanding farsighted planning from LLM agents. Thirdly, agents need to identify optimal solutions aligning with user constraints, such as security concerns and preferences. These challenges motivate AndroidArena, an environment and benchmark designed to evaluate LLM agents on a modern operating system. To address high-cost of manpower, we design a scalable and semi-automated method to construct the benchmark. In the task evaluation, AndroidArena incorporates accurate and adaptive metrics to address the issue of non-unique solutions. Our findings reveal that even state-of-the-art LLM agents struggle in cross-APP scenarios and adhering to specific constraints. Additionally, we identify a lack of four key capabilities, i.e., understanding, reasoning, exploration, and reflection, as primary reasons for the failure of LLM agents. Furthermore, we provide empirical analysis on the failure of reflection, and improve the success rate by 27% with our proposed exploration strategy. This work is the first to present valuable insights in understanding fine-grained weakness of LLM agents, and offers a path forward for future research in this area. Environment, benchmark, and evaluation code for AndroidArena are released at https://github.com/AndroidArenaAgent/AndroidArena.
title	Understanding the Weakness of Large Language Model Agents within a Complex Android Environment
topic	Artificial Intelligence Human-Computer Interaction Software Engineering
url	https://arxiv.org/abs/2402.06596

Similar Items