Saved in:
Bibliographic Details
Main Authors: Wei, Yanbin, Fu, Shuai, Jiang, Weisen, Zhang, Zejian, Zeng, Zhixiong, Wu, Qi, Kwok, James T., Zhang, Yu
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.02130
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929569886371840
author Wei, Yanbin
Fu, Shuai
Jiang, Weisen
Zhang, Zejian
Zeng, Zhixiong
Wu, Qi
Kwok, James T.
Zhang, Yu
author_facet Wei, Yanbin
Fu, Shuai
Jiang, Weisen
Zhang, Zejian
Zeng, Zhixiong
Wu, Qi
Kwok, James T.
Zhang, Yu
contents Large Language Models (LLMs) are increasingly used for various tasks with graph structures. Though LLMs can process graph information in a textual format, they overlook the rich vision modality, which is an intuitive way for humans to comprehend structural information and conduct general graph reasoning. The potential benefits and capabilities of representing graph structures as visual images (i.e., $\textit{visual graph}$) are still unexplored. To fill the gap, we innovatively propose an end-to-end framework, called $\textbf{G}$raph to v$\textbf{I}$sual and $\textbf{T}$extual Integr$\textbf{A}$tion (GITA), which firstly incorporates visual graphs into general graph reasoning. Besides, we establish $\textbf{G}$raph-based $\textbf{V}$ision-$\textbf{L}$anguage $\textbf{Q}$uestion $\textbf{A}$nswering (GVLQA) dataset from existing graph data, which is the first vision-language dataset for general graph reasoning purposes. Extensive experiments on the GVLQA dataset and five real-world datasets show that GITA outperforms mainstream LLMs in terms of general graph reasoning capabilities. Moreover, We highlight the effectiveness of the layout augmentation on visual graphs and pretraining on the GVLQA dataset.
format Preprint
id arxiv_https___arxiv_org_abs_2402_02130
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
Wei, Yanbin
Fu, Shuai
Jiang, Weisen
Zhang, Zejian
Zeng, Zhixiong
Wu, Qi
Kwok, James T.
Zhang, Yu
Computation and Language
Large Language Models (LLMs) are increasingly used for various tasks with graph structures. Though LLMs can process graph information in a textual format, they overlook the rich vision modality, which is an intuitive way for humans to comprehend structural information and conduct general graph reasoning. The potential benefits and capabilities of representing graph structures as visual images (i.e., $\textit{visual graph}$) are still unexplored. To fill the gap, we innovatively propose an end-to-end framework, called $\textbf{G}$raph to v$\textbf{I}$sual and $\textbf{T}$extual Integr$\textbf{A}$tion (GITA), which firstly incorporates visual graphs into general graph reasoning. Besides, we establish $\textbf{G}$raph-based $\textbf{V}$ision-$\textbf{L}$anguage $\textbf{Q}$uestion $\textbf{A}$nswering (GVLQA) dataset from existing graph data, which is the first vision-language dataset for general graph reasoning purposes. Extensive experiments on the GVLQA dataset and five real-world datasets show that GITA outperforms mainstream LLMs in terms of general graph reasoning capabilities. Moreover, We highlight the effectiveness of the layout augmentation on visual graphs and pretraining on the GVLQA dataset.
title GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
topic Computation and Language
url https://arxiv.org/abs/2402.02130