Saved in:
| Main Authors: | Lim, Jaehyuk, Lee, Bruce W. |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2408.09111 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
by: Zhao, Henry Hengyuan, et al.
Published: (2025)
by: Zhao, Henry Hengyuan, et al.
Published: (2025)
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
by: Sun, Qiushi, et al.
Published: (2025)
by: Sun, Qiushi, et al.
Published: (2025)
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
by: Kapoor, Raghav, et al.
Published: (2024)
by: Kapoor, Raghav, et al.
Published: (2024)
AIN: The Arabic INclusive Large Multimodal Model
by: Heakl, Ahmed, et al.
Published: (2025)
by: Heakl, Ahmed, et al.
Published: (2025)
E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model
by: Lin, Ronghao, et al.
Published: (2025)
by: Lin, Ronghao, et al.
Published: (2025)
Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition
by: Tsangko, Iosif, et al.
Published: (2025)
by: Tsangko, Iosif, et al.
Published: (2025)
Can Large Language Models Capture Video Game Engagement?
by: Melhart, David, et al.
Published: (2025)
by: Melhart, David, et al.
Published: (2025)
GUICourse: From General Vision Language Models to Versatile GUI Agents
by: Chen, Wentong, et al.
Published: (2024)
by: Chen, Wentong, et al.
Published: (2024)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
by: Chen, Baiyu, et al.
Published: (2026)
by: Chen, Baiyu, et al.
Published: (2026)
Five Years of SciCap: What We Learned and Future Directions for Scientific Figure Captioning
by: Huang, Ting-Hao 'Kenneth', et al.
Published: (2025)
by: Huang, Ting-Hao 'Kenneth', et al.
Published: (2025)
Code2World: A GUI World Model via Renderable Code Generation
by: Zheng, Yuhao, et al.
Published: (2026)
by: Zheng, Yuhao, et al.
Published: (2026)
Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models
by: Lopez-Cardona, Angela, et al.
Published: (2024)
by: Lopez-Cardona, Angela, et al.
Published: (2024)
A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models
by: Tsangko, Iosif, et al.
Published: (2026)
by: Tsangko, Iosif, et al.
Published: (2026)
Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations
by: Pal, Ankit, et al.
Published: (2024)
by: Pal, Ankit, et al.
Published: (2024)
EEG-based Multimodal Representation Learning for Emotion Recognition
by: Yin, Kang, et al.
Published: (2024)
by: Yin, Kang, et al.
Published: (2024)
GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding
by: Zhou, Shijie, et al.
Published: (2025)
by: Zhou, Shijie, et al.
Published: (2025)
Tur[k]ingBench: A Challenge Benchmark for Web Agents
by: Xu, Kevin, et al.
Published: (2024)
by: Xu, Kevin, et al.
Published: (2024)
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
by: Sun, Qiushi, et al.
Published: (2024)
by: Sun, Qiushi, et al.
Published: (2024)
SwissADT: An Audio Description Translation System for Swiss Languages
by: Fischer, Lukas, et al.
Published: (2024)
by: Fischer, Lukas, et al.
Published: (2024)
ASL STEM Wiki: Dataset and Benchmark for Interpreting STEM Articles
by: Yin, Kayo, et al.
Published: (2024)
by: Yin, Kayo, et al.
Published: (2024)
How Good (Or Bad) Are LLMs at Detecting Misleading Visualizations?
by: Lo, Leo Yu-Ho, et al.
Published: (2024)
by: Lo, Leo Yu-Ho, et al.
Published: (2024)
Proactive Assistant Dialogue Generation from Streaming Egocentric Videos
by: Zhang, Yichi, et al.
Published: (2025)
by: Zhang, Yichi, et al.
Published: (2025)
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
by: Sun, Qiushi, et al.
Published: (2025)
by: Sun, Qiushi, et al.
Published: (2025)
Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization
by: Wang, Yibo, et al.
Published: (2026)
by: Wang, Yibo, et al.
Published: (2026)
SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
by: Jing, Hongyi, et al.
Published: (2025)
by: Jing, Hongyi, et al.
Published: (2025)
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
by: Wang, Haoming, et al.
Published: (2025)
by: Wang, Haoming, et al.
Published: (2025)
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
by: Xie, Tianbao, et al.
Published: (2025)
by: Xie, Tianbao, et al.
Published: (2025)
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
by: Ouyang, Mingyu, et al.
Published: (2026)
by: Ouyang, Mingyu, et al.
Published: (2026)
DesignPref: Capturing Personal Preferences in Visual Design Generation
by: Peng, Yi-Hao, et al.
Published: (2025)
by: Peng, Yi-Hao, et al.
Published: (2025)
AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent
by: Fan, Jingru, et al.
Published: (2025)
by: Fan, Jingru, et al.
Published: (2025)
SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation
by: Khalil, Mohammad Amer, et al.
Published: (2026)
by: Khalil, Mohammad Amer, et al.
Published: (2026)
OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
by: Cheng, Kanzhi, et al.
Published: (2026)
by: Cheng, Kanzhi, et al.
Published: (2026)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
by: Qin, Yujia, et al.
Published: (2025)
by: Qin, Yujia, et al.
Published: (2025)
History-Aware Reasoning for GUI Agents
by: Wang, Ziwei, et al.
Published: (2025)
by: Wang, Ziwei, et al.
Published: (2025)
OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?
by: Chen, Xuetian, et al.
Published: (2025)
by: Chen, Xuetian, et al.
Published: (2025)
Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis
by: Brock, James, et al.
Published: (2026)
by: Brock, James, et al.
Published: (2026)
A Survey on (M)LLM-Based GUI Agents
by: Tang, Fei, et al.
Published: (2025)
by: Tang, Fei, et al.
Published: (2025)
Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
by: Wu, Zongru, et al.
Published: (2025)
by: Wu, Zongru, et al.
Published: (2025)
MAPWise: Evaluating Vision-Language Models for Advanced Map Queries
by: Mukhopadhyay, Srija, et al.
Published: (2024)
by: Mukhopadhyay, Srija, et al.
Published: (2024)
Similar Items
-
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
by: Zhao, Henry Hengyuan, et al.
Published: (2025) -
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
by: Sun, Qiushi, et al.
Published: (2025) -
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
by: Kapoor, Raghav, et al.
Published: (2024) -
AIN: The Arabic INclusive Large Multimodal Model
by: Heakl, Ahmed, et al.
Published: (2025) -
E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model
by: Lin, Ronghao, et al.
Published: (2025)