:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gu, Li, Jiang, Zihuan, Chi, Zhixiang, Liu, Huan, Wang, Ziqiang, Yu, Yuanhao, Berseth, Glen, Wang, Yang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Computation and Language Human-Computer Interaction Machine Learning
Online Access:	https://arxiv.org/abs/2603.07432
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent
by: Fan, Jingru, et al.
Published: (2025)

Computer-Use Agents as Judges for Generative User Interface
by: Lin, Kevin Qinghong, et al.
Published: (2025)

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
by: Wang, Haoming, et al.
Published: (2025)

ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
by: Hsiao, Yu-Chung, et al.
Published: (2022)

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
by: Cheng, Kanzhi, et al.
Published: (2026)

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
by: You, Keen, et al.
Published: (2024)

History-Aware Reasoning for GUI Agents
by: Wang, Ziwei, et al.
Published: (2025)

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
by: Wu, Zhiyong, et al.
Published: (2024)

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
by: Mazumdar, Amrita, et al.
Published: (2026)

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
by: Luo, Run, et al.
Published: (2025)

OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
by: Sun, Qiushi, et al.
Published: (2025)

GUICourse: From General Vision Language Models to Versatile GUI Agents
by: Chen, Wentong, et al.
Published: (2024)

Morae: Proactively Pausing UI Agents for User Choices
by: Peng, Yi-Hao, et al.
Published: (2025)

Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding
by: Song, Yaoxian, et al.
Published: (2023)

UI-TARS: Pioneering Automated GUI Interaction with Native Agents
by: Qin, Yujia, et al.
Published: (2025)

ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by: Lin, Kevin Qinghong, et al.
Published: (2024)

SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs
by: Wang, Siting, et al.
Published: (2025)

Long-Term Ad Memorability: Understanding & Generating Memorable Ads
by: SI, Harini, et al.
Published: (2023)

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception
by: da Silva, Neemias, et al.
Published: (2026)

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
by: Sun, Qiushi, et al.
Published: (2024)

UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments
by: Xiang, Jiannan, et al.
Published: (2025)

GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting
by: Yang, Kaichun, et al.
Published: (2025)

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
by: Sun, Qiushi, et al.
Published: (2025)

True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies
by: Blasilli, Graziano, et al.
Published: (2026)

Tur[k]ingBench: A Challenge Benchmark for Web Agents
by: Xu, Kevin, et al.
Published: (2024)

Learning Multimodal Cues of Children's Uncertainty
by: Cheng, Qi, et al.
Published: (2024)

What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric
by: Kerkouri, Mohamed Amine, et al.
Published: (2026)

SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
by: Jing, Hongyi, et al.
Published: (2025)

EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution
by: Wang, Tianfu, et al.
Published: (2026)

DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation
by: Li, Jia, et al.
Published: (2024)

Investigating Disability Representations in Text-to-Image Models
by: Tian, Yang, et al.
Published: (2026)

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
by: Kapoor, Raghav, et al.
Published: (2024)

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
by: Xie, Tianbao, et al.
Published: (2025)

SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
by: Huang, Xiyang, et al.
Published: (2026)

Android in the Zoo: Chain-of-Action-Thought for GUI Agents
by: Zhang, Jiwen, et al.
Published: (2024)

A Survey on (M)LLM-Based GUI Agents
by: Tang, Fei, et al.
Published: (2025)

Code2World: A GUI World Model via Renderable Code Generation
by: Zheng, Yuhao, et al.
Published: (2026)

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis
by: Brock, James, et al.
Published: (2026)

OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?
by: Chen, Xuetian, et al.
Published: (2025)

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
by: Wu, Zongru, et al.
Published: (2025)