Saved in:
| Main Authors: | Shen, Huawen, Liu, Chang, Li, Gengluo, Wang, Xinlong, Zhou, Yu, Ma, Can, Ji, Xiangyang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.09362 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining
by: Shen, Huawen, et al.
Published: (2024)
by: Shen, Huawen, et al.
Published: (2024)
Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts
by: Li, Gengluo, et al.
Published: (2025)
by: Li, Gengluo, et al.
Published: (2025)
Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition
by: Wu, Daiqing, et al.
Published: (2024)
by: Wu, Daiqing, et al.
Published: (2024)
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
by: Zhang, Yan, et al.
Published: (2026)
by: Zhang, Yan, et al.
Published: (2026)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
by: Qin, Yujia, et al.
Published: (2025)
by: Qin, Yujia, et al.
Published: (2025)
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
by: Liu, Xinyi, et al.
Published: (2025)
by: Liu, Xinyi, et al.
Published: (2025)
UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
by: Lian, Shuquan, et al.
Published: (2025)
by: Lian, Shuquan, et al.
Published: (2025)
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
by: Tang, Fei, et al.
Published: (2026)
by: Tang, Fei, et al.
Published: (2026)
Enhancing and Assessing Instruction-Following with Fine-Grained Instruction Variants
by: Yang, Jiuding, et al.
Published: (2024)
by: Yang, Jiuding, et al.
Published: (2024)
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
by: Wang, Haoming, et al.
Published: (2025)
by: Wang, Haoming, et al.
Published: (2025)
SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
by: Jing, Hongyi, et al.
Published: (2025)
by: Jing, Hongyi, et al.
Published: (2025)
Aria-UI: Visual Grounding for GUI Instructions
by: Yang, Yuhao, et al.
Published: (2024)
by: Yang, Yuhao, et al.
Published: (2024)
Infer Human's Intentions Before Following Natural Language Instructions
by: Wan, Yanming, et al.
Published: (2024)
by: Wan, Yanming, et al.
Published: (2024)
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
by: Li, Zhangheng, et al.
Published: (2024)
by: Li, Zhangheng, et al.
Published: (2024)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
by: Xiao, Han, et al.
Published: (2025)
by: Xiao, Han, et al.
Published: (2025)
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
by: Li, Gengluo, et al.
Published: (2026)
by: Li, Gengluo, et al.
Published: (2026)
GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding
by: Chen, Dongping, et al.
Published: (2024)
by: Chen, Dongping, et al.
Published: (2024)
MAIC-UI: Making Interactive Courseware with Generative UI
by: Tu, Shangqing, et al.
Published: (2026)
by: Tu, Shangqing, et al.
Published: (2026)
MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos
by: Zang, Yuan, et al.
Published: (2025)
by: Zang, Yuan, et al.
Published: (2025)
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
by: You, Keen, et al.
Published: (2024)
by: You, Keen, et al.
Published: (2024)
One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework
by: Jia, Qi, et al.
Published: (2025)
by: Jia, Qi, et al.
Published: (2025)
MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation
by: Liu, Yile, et al.
Published: (2025)
by: Liu, Yile, et al.
Published: (2025)
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
by: Nayak, Shravan, et al.
Published: (2025)
by: Nayak, Shravan, et al.
Published: (2025)
Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
by: Yang, Zhen, et al.
Published: (2025)
by: Yang, Zhen, et al.
Published: (2025)
UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions
by: Han, Wenkang, et al.
Published: (2025)
by: Han, Wenkang, et al.
Published: (2025)
From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning
by: Wu, Xuansheng, et al.
Published: (2023)
by: Wu, Xuansheng, et al.
Published: (2023)
GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
by: Zhou, Yuqi, et al.
Published: (2025)
by: Zhou, Yuqi, et al.
Published: (2025)
Morae: Proactively Pausing UI Agents for User Choices
by: Peng, Yi-Hao, et al.
Published: (2025)
by: Peng, Yi-Hao, et al.
Published: (2025)
UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning
by: Chen, Liangyu, et al.
Published: (2025)
by: Chen, Liangyu, et al.
Published: (2025)
Identifying User Goals from UI Trajectories
by: Berkovitch, Omri, et al.
Published: (2024)
by: Berkovitch, Omri, et al.
Published: (2024)
UltraIF: Advancing Instruction Following from the Wild
by: An, Kaikai, et al.
Published: (2025)
by: An, Kaikai, et al.
Published: (2025)
Incubating Text Classifiers Following User Instruction with Nothing but LLM
by: Peng, Letian, et al.
Published: (2024)
by: Peng, Letian, et al.
Published: (2024)
Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding
by: Jeon, Jaehyun, et al.
Published: (2025)
by: Jeon, Jaehyun, et al.
Published: (2025)
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
by: Yang, Wenkui, et al.
Published: (2026)
by: Yang, Wenkui, et al.
Published: (2026)
Instruction Following without Instruction Tuning
by: Hewitt, John, et al.
Published: (2024)
by: Hewitt, John, et al.
Published: (2024)
Instruction-Following Pruning for Large Language Models
by: Hou, Bairu, et al.
Published: (2025)
by: Hou, Bairu, et al.
Published: (2025)
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
by: Wu, Qinzhuo, et al.
Published: (2024)
by: Wu, Qinzhuo, et al.
Published: (2024)
MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation
by: Li, Gengluo, et al.
Published: (2026)
by: Li, Gengluo, et al.
Published: (2026)
StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following
by: Li, Jinnan, et al.
Published: (2025)
by: Li, Jinnan, et al.
Published: (2025)
Similar Items
-
LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining
by: Shen, Huawen, et al.
Published: (2024) -
Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts
by: Li, Gengluo, et al.
Published: (2025) -
Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition
by: Wu, Daiqing, et al.
Published: (2024) -
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
by: Zhang, Yan, et al.
Published: (2026) -
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
by: Qin, Yujia, et al.
Published: (2025)