:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Shen, Huawen, Liu, Chang, Li, Gengluo, Wang, Xinlong, Zhou, Yu, Ma, Can, Ji, Xiangyang
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2412.09362
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining
by: Shen, Huawen, et al.
Published: (2024)

Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts
by: Li, Gengluo, et al.
Published: (2025)

Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition
by: Wu, Daiqing, et al.
Published: (2024)

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
by: Zhang, Yan, et al.
Published: (2026)

UI-TARS: Pioneering Automated GUI Interaction with Native Agents
by: Qin, Yujia, et al.
Published: (2025)

UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
by: Liu, Xinyi, et al.
Published: (2025)

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
by: Lian, Shuquan, et al.
Published: (2025)

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
by: Tang, Fei, et al.
Published: (2026)

Enhancing and Assessing Instruction-Following with Fine-Grained Instruction Variants
by: Yang, Jiuding, et al.
Published: (2024)

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
by: Wang, Haoming, et al.
Published: (2025)

SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
by: Jing, Hongyi, et al.
Published: (2025)

Aria-UI: Visual Grounding for GUI Instructions
by: Yang, Yuhao, et al.
Published: (2024)

Infer Human's Intentions Before Following Natural Language Instructions
by: Wan, Yanming, et al.
Published: (2024)

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
by: Li, Zhangheng, et al.
Published: (2024)

ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by: Lin, Kevin Qinghong, et al.
Published: (2024)

UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
by: Xiao, Han, et al.
Published: (2025)

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
by: Li, Gengluo, et al.
Published: (2026)

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding
by: Chen, Dongping, et al.
Published: (2024)

MAIC-UI: Making Interactive Courseware with Generative UI
by: Tu, Shangqing, et al.
Published: (2026)

MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos
by: Zang, Yuan, et al.
Published: (2025)

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
by: You, Keen, et al.
Published: (2024)

One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework
by: Jia, Qi, et al.
Published: (2025)

MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation
by: Liu, Yile, et al.
Published: (2025)

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
by: Nayak, Shravan, et al.
Published: (2025)

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
by: Yang, Zhen, et al.
Published: (2025)

UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions
by: Han, Wenkang, et al.
Published: (2025)

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning
by: Wu, Xuansheng, et al.
Published: (2023)

GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
by: Zhou, Yuqi, et al.
Published: (2025)

Morae: Proactively Pausing UI Agents for User Choices
by: Peng, Yi-Hao, et al.
Published: (2025)

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning
by: Chen, Liangyu, et al.
Published: (2025)

Identifying User Goals from UI Trajectories
by: Berkovitch, Omri, et al.
Published: (2024)

UltraIF: Advancing Instruction Following from the Wild
by: An, Kaikai, et al.
Published: (2025)

Incubating Text Classifiers Following User Instruction with Nothing but LLM
by: Peng, Letian, et al.
Published: (2024)

Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding
by: Jeon, Jaehyun, et al.
Published: (2025)

Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
by: Yang, Wenkui, et al.
Published: (2026)

Instruction Following without Instruction Tuning
by: Hewitt, John, et al.
Published: (2024)

Instruction-Following Pruning for Large Language Models
by: Hou, Bairu, et al.
Published: (2025)

MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
by: Wu, Qinzhuo, et al.
Published: (2024)

MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation
by: Li, Gengluo, et al.
Published: (2026)

StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following
by: Li, Jinnan, et al.
Published: (2025)