Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Xuetian, Li, Hangcheng, Liang, Jiaqing, Jiang, Sihang, Yang, Deqing
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2410.19461
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915002457260032
author	Chen, Xuetian Li, Hangcheng Liang, Jiaqing Jiang, Sihang Yang, Deqing
author_facet	Chen, Xuetian Li, Hangcheng Liang, Jiaqing Jiang, Sihang Yang, Deqing
contents	Autonomous agents operating on the graphical user interfaces (GUIs) of various applications hold immense practical value. Unlike the large language model (LLM)-based methods which rely on structured texts and customized backends, the approaches using large vision-language models (LVLMs) are more intuitive and adaptable as they can visually perceive and directly interact with screens, making them indispensable in general scenarios without text metadata and tailored backends. Given the lack of high-quality training data for GUI-related tasks in existing work, this paper aims to enhance the GUI understanding and interacting capabilities of LVLMs through a data-driven approach. We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web. Evaluation results on various GUI and agent benchmarks demonstrate that the model trained with the dataset generated through EDGE exhibits superior webpage understanding capabilities, which can then be easily transferred to previously unseen desktop and mobile environments. Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work. Our source code, the dataset and the model are available at https://anonymous.4open.science/r/EDGE-1CDB.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_19461
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data Chen, Xuetian Li, Hangcheng Liang, Jiaqing Jiang, Sihang Yang, Deqing Artificial Intelligence Autonomous agents operating on the graphical user interfaces (GUIs) of various applications hold immense practical value. Unlike the large language model (LLM)-based methods which rely on structured texts and customized backends, the approaches using large vision-language models (LVLMs) are more intuitive and adaptable as they can visually perceive and directly interact with screens, making them indispensable in general scenarios without text metadata and tailored backends. Given the lack of high-quality training data for GUI-related tasks in existing work, this paper aims to enhance the GUI understanding and interacting capabilities of LVLMs through a data-driven approach. We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web. Evaluation results on various GUI and agent benchmarks demonstrate that the model trained with the dataset generated through EDGE exhibits superior webpage understanding capabilities, which can then be easily transferred to previously unseen desktop and mobile environments. Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work. Our source code, the dataset and the model are available at https://anonymous.4open.science/r/EDGE-1CDB.
title	EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
topic	Artificial Intelligence
url	https://arxiv.org/abs/2410.19461

Similar Items