Saved in:
| Main Authors: | Gu, Li, Jiang, Zihuan, Chi, Zhixiang, Liu, Huan, Wang, Ziqiang, Yu, Yuanhao, Berseth, Glen, Wang, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.07432 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent
by: Fan, Jingru, et al.
Published: (2025)
by: Fan, Jingru, et al.
Published: (2025)
Computer-Use Agents as Judges for Generative User Interface
by: Lin, Kevin Qinghong, et al.
Published: (2025)
by: Lin, Kevin Qinghong, et al.
Published: (2025)
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
by: Wang, Haoming, et al.
Published: (2025)
by: Wang, Haoming, et al.
Published: (2025)
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
by: Hsiao, Yu-Chung, et al.
Published: (2022)
by: Hsiao, Yu-Chung, et al.
Published: (2022)
OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
by: Cheng, Kanzhi, et al.
Published: (2026)
by: Cheng, Kanzhi, et al.
Published: (2026)
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
by: You, Keen, et al.
Published: (2024)
by: You, Keen, et al.
Published: (2024)
History-Aware Reasoning for GUI Agents
by: Wang, Ziwei, et al.
Published: (2025)
by: Wang, Ziwei, et al.
Published: (2025)
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
by: Wu, Zhiyong, et al.
Published: (2024)
by: Wu, Zhiyong, et al.
Published: (2024)
VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
by: Mazumdar, Amrita, et al.
Published: (2026)
by: Mazumdar, Amrita, et al.
Published: (2026)
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
by: Luo, Run, et al.
Published: (2025)
by: Luo, Run, et al.
Published: (2025)
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
by: Sun, Qiushi, et al.
Published: (2025)
by: Sun, Qiushi, et al.
Published: (2025)
GUICourse: From General Vision Language Models to Versatile GUI Agents
by: Chen, Wentong, et al.
Published: (2024)
by: Chen, Wentong, et al.
Published: (2024)
Morae: Proactively Pausing UI Agents for User Choices
by: Peng, Yi-Hao, et al.
Published: (2025)
by: Peng, Yi-Hao, et al.
Published: (2025)
Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding
by: Song, Yaoxian, et al.
Published: (2023)
by: Song, Yaoxian, et al.
Published: (2023)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
by: Qin, Yujia, et al.
Published: (2025)
by: Qin, Yujia, et al.
Published: (2025)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs
by: Wang, Siting, et al.
Published: (2025)
by: Wang, Siting, et al.
Published: (2025)
Long-Term Ad Memorability: Understanding & Generating Memorable Ads
by: SI, Harini, et al.
Published: (2023)
by: SI, Harini, et al.
Published: (2023)
Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception
by: da Silva, Neemias, et al.
Published: (2026)
by: da Silva, Neemias, et al.
Published: (2026)
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
by: Sun, Qiushi, et al.
Published: (2024)
by: Sun, Qiushi, et al.
Published: (2024)
UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments
by: Xiang, Jiannan, et al.
Published: (2025)
by: Xiang, Jiannan, et al.
Published: (2025)
GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting
by: Yang, Kaichun, et al.
Published: (2025)
by: Yang, Kaichun, et al.
Published: (2025)
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
by: Sun, Qiushi, et al.
Published: (2025)
by: Sun, Qiushi, et al.
Published: (2025)
True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies
by: Blasilli, Graziano, et al.
Published: (2026)
by: Blasilli, Graziano, et al.
Published: (2026)
Tur[k]ingBench: A Challenge Benchmark for Web Agents
by: Xu, Kevin, et al.
Published: (2024)
by: Xu, Kevin, et al.
Published: (2024)
Learning Multimodal Cues of Children's Uncertainty
by: Cheng, Qi, et al.
Published: (2024)
by: Cheng, Qi, et al.
Published: (2024)
What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric
by: Kerkouri, Mohamed Amine, et al.
Published: (2026)
by: Kerkouri, Mohamed Amine, et al.
Published: (2026)
SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
by: Jing, Hongyi, et al.
Published: (2025)
by: Jing, Hongyi, et al.
Published: (2025)
EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution
by: Wang, Tianfu, et al.
Published: (2026)
by: Wang, Tianfu, et al.
Published: (2026)
DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
Investigating Disability Representations in Text-to-Image Models
by: Tian, Yang, et al.
Published: (2026)
by: Tian, Yang, et al.
Published: (2026)
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
by: Kapoor, Raghav, et al.
Published: (2024)
by: Kapoor, Raghav, et al.
Published: (2024)
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
by: Xie, Tianbao, et al.
Published: (2025)
by: Xie, Tianbao, et al.
Published: (2025)
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
by: Huang, Xiyang, et al.
Published: (2026)
by: Huang, Xiyang, et al.
Published: (2026)
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
by: Zhang, Jiwen, et al.
Published: (2024)
by: Zhang, Jiwen, et al.
Published: (2024)
A Survey on (M)LLM-Based GUI Agents
by: Tang, Fei, et al.
Published: (2025)
by: Tang, Fei, et al.
Published: (2025)
Code2World: A GUI World Model via Renderable Code Generation
by: Zheng, Yuhao, et al.
Published: (2026)
by: Zheng, Yuhao, et al.
Published: (2026)
Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis
by: Brock, James, et al.
Published: (2026)
by: Brock, James, et al.
Published: (2026)
OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?
by: Chen, Xuetian, et al.
Published: (2025)
by: Chen, Xuetian, et al.
Published: (2025)
Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
by: Wu, Zongru, et al.
Published: (2025)
by: Wu, Zongru, et al.
Published: (2025)
Similar Items
-
AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent
by: Fan, Jingru, et al.
Published: (2025) -
Computer-Use Agents as Judges for Generative User Interface
by: Lin, Kevin Qinghong, et al.
Published: (2025) -
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
by: Wang, Haoming, et al.
Published: (2025) -
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
by: Hsiao, Yu-Chung, et al.
Published: (2022) -
OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
by: Cheng, Kanzhi, et al.
Published: (2026)