:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Agashe, Saaket, Wong, Kyle, Tu, Vincent, Yang, Jiachen, Li, Ang, Wang, Xin Eric
Formato:	Preprint
Publicado:	2025
Materias:	Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Machine Learning
Acceso en línea:	https://arxiv.org/abs/2504.00906
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

Agent S: An Open Agentic Framework that Uses Computers Like a Human
por: Agashe, Saaket, et al.
Publicado: (2024)

Scaling Agents for Computer Use
por: Gonzalez-Pumariega, Gonzalo, et al.
Publicado: (2025)

On the Reliability of Computer Use Agents
por: Gonzalez-Pumariega, Gonzalo, et al.
Publicado: (2026)

An Embodied Generalist Agent in 3D World
por: Huang, Jiangyong, et al.
Publicado: (2023)

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent
por: Yang, Bowen, et al.
Publicado: (2026)

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
por: Hu, Xueyu, et al.
Publicado: (2025)

GPT-4V(ision) is a Generalist Web Agent, if Grounded
por: Zheng, Boyuan, et al.
Publicado: (2024)

The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
por: Hu, Siyuan, et al.
Publicado: (2024)

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents
por: Zheng, Kaizhi, et al.
Publicado: (2022)

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
por: Kapoor, Raghav, et al.
Publicado: (2024)

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
por: Ashraf, Tajamul, et al.
Publicado: (2025)

Self-Resource Allocation in Multi-Agent LLM Systems
por: Amayuelas, Alfonso, et al.
Publicado: (2025)

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
por: Li, Yan, et al.
Publicado: (2026)

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
por: Feng, Weixi, et al.
Publicado: (2024)

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
por: Hu, Wenbo, et al.
Publicado: (2026)

ComCLIP: Training-Free Compositional Image and Text Matching
por: Jiang, Kenan, et al.
Publicado: (2022)

CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering
por: Saha, Aranya, et al.
Publicado: (2025)

InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
por: InternAgent Team, et al.
Publicado: (2025)

Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
por: Zhang, Fan, et al.
Publicado: (2024)

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
por: Liu, Xiao, et al.
Publicado: (2024)

From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization
por: Ji, Haonian, et al.
Publicado: (2025)

METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling
por: Li, Bingxuan, et al.
Publicado: (2025)

Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning
por: Zhang, Yunchao, et al.
Publicado: (2022)

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
por: LASA Team, et al.
Publicado: (2025)

DPO Learning with LLMs-Judge Signal for Computer Use Agents
por: Luo, Man, et al.
Publicado: (2025)

PresentAgent-2: Towards Generalist Multimodal Presentation Agents
por: Wu, Wei, et al.
Publicado: (2026)

RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models
por: Chen, Haoyu, et al.
Publicado: (2024)

Teaching Embodied Reinforcement Learning Agents: Informativeness and Diversity of Language Use
por: Xi, Jiajun, et al.
Publicado: (2024)

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences
por: Jin, Zhuoran, et al.
Publicado: (2025)

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling
por: Buettner, Kyle, et al.
Publicado: (2025)

Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning
por: Gu, Zishan, et al.
Publicado: (2024)

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
por: Wu, Qianhui, et al.
Publicado: (2025)

A Multimodal Automated Interpretability Agent
por: Shaham, Tamar Rott, et al.
Publicado: (2024)

Large Multimodal Agents: A Survey
por: Xie, Junlin, et al.
Publicado: (2024)

OpenCUA: Open Foundations for Computer-Use Agents
por: Wang, Xinyuan, et al.
Publicado: (2025)

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security
por: Liu, Dongrui, et al.
Publicado: (2026)

Fara-7B: An Efficient Agentic Model for Computer Use
por: Awadallah, Ahmed, et al.
Publicado: (2025)

From Specialist to Generalist: Unlocking SAM's Learning Potential on Unlabeled Medical Images
por: Vu, Vi, et al.
Publicado: (2026)

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
por: Wei, Cong, et al.
Publicado: (2024)

MCU: An Evaluation Framework for Open-Ended Game Agents
por: Zheng, Xinyue, et al.
Publicado: (2023)