:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Lim, Jaehyuk, Lee, Bruce W.
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Human-Computer Interaction
Online Access:	https://arxiv.org/abs/2408.09111
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
by: Zhao, Henry Hengyuan, et al.
Published: (2025)

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
by: Sun, Qiushi, et al.
Published: (2025)

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
by: Kapoor, Raghav, et al.
Published: (2024)

AIN: The Arabic INclusive Large Multimodal Model
by: Heakl, Ahmed, et al.
Published: (2025)

E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model
by: Lin, Ronghao, et al.
Published: (2025)

Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition
by: Tsangko, Iosif, et al.
Published: (2025)

Can Large Language Models Capture Video Game Engagement?
by: Melhart, David, et al.
Published: (2025)

GUICourse: From General Vision Language Models to Versatile GUI Agents
by: Chen, Wentong, et al.
Published: (2024)

ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by: Lin, Kevin Qinghong, et al.
Published: (2024)

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
by: Chen, Baiyu, et al.
Published: (2026)

Five Years of SciCap: What We Learned and Future Directions for Scientific Figure Captioning
by: Huang, Ting-Hao 'Kenneth', et al.
Published: (2025)

Code2World: A GUI World Model via Renderable Code Generation
by: Zheng, Yuhao, et al.
Published: (2026)

Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models
by: Lopez-Cardona, Angela, et al.
Published: (2024)

A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models
by: Tsangko, Iosif, et al.
Published: (2026)

Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations
by: Pal, Ankit, et al.
Published: (2024)

EEG-based Multimodal Representation Learning for Emotion Recognition
by: Yin, Kang, et al.
Published: (2024)

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding
by: Zhou, Shijie, et al.
Published: (2025)

Tur[k]ingBench: A Challenge Benchmark for Web Agents
by: Xu, Kevin, et al.
Published: (2024)

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
by: Sun, Qiushi, et al.
Published: (2024)

SwissADT: An Audio Description Translation System for Swiss Languages
by: Fischer, Lukas, et al.
Published: (2024)

ASL STEM Wiki: Dataset and Benchmark for Interpreting STEM Articles
by: Yin, Kayo, et al.
Published: (2024)

How Good (Or Bad) Are LLMs at Detecting Misleading Visualizations?
by: Lo, Leo Yu-Ho, et al.
Published: (2024)

Proactive Assistant Dialogue Generation from Streaming Egocentric Videos
by: Zhang, Yichi, et al.
Published: (2025)

OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
by: Sun, Qiushi, et al.
Published: (2025)

Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization
by: Wang, Yibo, et al.
Published: (2026)

SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
by: Jing, Hongyi, et al.
Published: (2025)

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
by: Wang, Haoming, et al.
Published: (2025)

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
by: Xie, Tianbao, et al.
Published: (2025)

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
by: Ouyang, Mingyu, et al.
Published: (2026)

DesignPref: Capturing Personal Preferences in Visual Design Generation
by: Peng, Yi-Hao, et al.
Published: (2025)

AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent
by: Fan, Jingru, et al.
Published: (2025)

SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation
by: Khalil, Mohammad Amer, et al.
Published: (2026)

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
by: Cheng, Kanzhi, et al.
Published: (2026)

UI-TARS: Pioneering Automated GUI Interaction with Native Agents
by: Qin, Yujia, et al.
Published: (2025)

History-Aware Reasoning for GUI Agents
by: Wang, Ziwei, et al.
Published: (2025)

OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?
by: Chen, Xuetian, et al.
Published: (2025)

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis
by: Brock, James, et al.
Published: (2026)

A Survey on (M)LLM-Based GUI Agents
by: Tang, Fei, et al.
Published: (2025)

Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
by: Wu, Zongru, et al.
Published: (2025)

MAPWise: Evaluating Vision-Language Models for Advanced Map Queries
by: Mukhopadhyay, Srija, et al.
Published: (2024)