Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Lei, Bin, Kang, Weitai, Zhang, Zijian, Chen, Winson, Xie, Xi, Zuo, Shan, Xie, Mimi, Payani, Ali, Hong, Mingyi, Yan, Yan, Ding, Caiwen
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2505.10887
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866910181759123456
author Lei, Bin
Kang, Weitai
Zhang, Zijian
Chen, Winson
Xie, Xi
Zuo, Shan
Xie, Mimi
Payani, Ali
Hong, Mingyi
Yan, Yan
Ding, Caiwen
author_facet Lei, Bin
Kang, Weitai
Zhang, Zijian
Chen, Winson
Xie, Xi
Zuo, Shan
Xie, Mimi
Payani, Ali
Hong, Mingyi
Yan, Yan
Ding, Caiwen
contents This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $\mathbf{7.27\%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.
format Preprint
id arxiv_https___arxiv_org_abs_2505_10887
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
Lei, Bin
Kang, Weitai
Zhang, Zijian
Chen, Winson
Xie, Xi
Zuo, Shan
Xie, Mimi
Payani, Ali
Hong, Mingyi
Yan, Yan
Ding, Caiwen
Artificial Intelligence
This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $\mathbf{7.27\%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.
title InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
topic Artificial Intelligence
url https://arxiv.org/abs/2505.10887