Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Lei, Bin, Kang, Weitai, Zhang, Zijian, Chen, Winson, Xie, Xi, Zuo, Shan, Xie, Mimi, Payani, Ali, Hong, Mingyi, Yan, Yan, Ding, Caiwen
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2505.10887
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866910181759123456
author	Lei, Bin Kang, Weitai Zhang, Zijian Chen, Winson Xie, Xi Zuo, Shan Xie, Mimi Payani, Ali Hong, Mingyi Yan, Yan Ding, Caiwen
author_facet	Lei, Bin Kang, Weitai Zhang, Zijian Chen, Winson Xie, Xi Zuo, Shan Xie, Mimi Payani, Ali Hong, Mingyi Yan, Yan Ding, Caiwen
contents	This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $\mathbf{7.27\%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_10887
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction Lei, Bin Kang, Weitai Zhang, Zijian Chen, Winson Xie, Xi Zuo, Shan Xie, Mimi Payani, Ali Hong, Mingyi Yan, Yan Ding, Caiwen Artificial Intelligence This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $\mathbf{7.27\%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.
title	InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
topic	Artificial Intelligence
url	https://arxiv.org/abs/2505.10887

Ähnliche Einträge