Saved in:
Bibliographic Details
Main Authors: Lyu, Songlin, Ban, Lujie, Wu, Zihang, Luo, Tianqi, Liu, Jirong, Ma, Chenhao, Luo, Yuyu, Tang, Nan, Qi, Shipeng, Lin, Heng, Liu, Yongchao, Hong, Chuntao
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.11745
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908829501882368
author Lyu, Songlin
Ban, Lujie
Wu, Zihang
Luo, Tianqi
Liu, Jirong
Ma, Chenhao
Luo, Yuyu
Tang, Nan
Qi, Shipeng
Lin, Heng
Liu, Yongchao
Hong, Chuntao
author_facet Lyu, Songlin
Ban, Lujie
Wu, Zihang
Luo, Tianqi
Liu, Jirong
Ma, Chenhao
Luo, Yuyu
Tang, Nan
Qi, Shipeng
Lin, Heng
Liu, Yongchao
Hong, Chuntao
contents Graph models are fundamental to data analysis in domains rich with complex relationships. Text-to-Graph-Query-Language (Text-to-GQL) systems act as a translator, converting natural language into executable graph queries. This capability allows Large Language Models (LLMs) to directly analyze and manipulate graph data, posi-tioning them as powerful agent infrastructures for Graph Database Management System (GDBMS). Despite recent progress, existing datasets are often limited in domain coverage, supported graph query languages, or evaluation scope. The advancement of Text-to-GQL systems is hindered by the lack of high-quality benchmark datasets and evaluation methods to systematically compare model capabilities across different graph query languages and domains. In this work, we present Text2GQL-Bench, a unified Text-to-GQL benchmark designed to address these limitations. Text2GQL-Bench couples a multi-GQL dataset that has 178,184 (Question, Query) pairs spanning 13 domains, with a scalable construction framework that generates datasets in different domains, question abstraction levels, and GQLs with heterogeneous resources. To support compre-hensive assessment, we introduce an evaluation method that goes beyond a single end-to-end metric by jointly reporting grammatical validity, similarity, semantic alignment, and execution accuracy. Our evaluation uncovers a stark dialect gap in ISO-GQL generation: even strong LLMs achieve only at most 4% execution accuracy (EX) in zero-shot settings, though a fixed 3-shot prompt raises accuracy to around 50%, the grammatical validity remains lower than 70%. Moreover, a fine-tuned 8B open-weight model reaches 45.1% EX, and 90.8% grammatical validity, demonstrating that most of the performance jump is unlocked by exposure to sufficient ISO-GQL examples.
format Preprint
id arxiv_https___arxiv_org_abs_2602_11745
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis & Benchmark]
Lyu, Songlin
Ban, Lujie
Wu, Zihang
Luo, Tianqi
Liu, Jirong
Ma, Chenhao
Luo, Yuyu
Tang, Nan
Qi, Shipeng
Lin, Heng
Liu, Yongchao
Hong, Chuntao
Artificial Intelligence
Graph models are fundamental to data analysis in domains rich with complex relationships. Text-to-Graph-Query-Language (Text-to-GQL) systems act as a translator, converting natural language into executable graph queries. This capability allows Large Language Models (LLMs) to directly analyze and manipulate graph data, posi-tioning them as powerful agent infrastructures for Graph Database Management System (GDBMS). Despite recent progress, existing datasets are often limited in domain coverage, supported graph query languages, or evaluation scope. The advancement of Text-to-GQL systems is hindered by the lack of high-quality benchmark datasets and evaluation methods to systematically compare model capabilities across different graph query languages and domains. In this work, we present Text2GQL-Bench, a unified Text-to-GQL benchmark designed to address these limitations. Text2GQL-Bench couples a multi-GQL dataset that has 178,184 (Question, Query) pairs spanning 13 domains, with a scalable construction framework that generates datasets in different domains, question abstraction levels, and GQLs with heterogeneous resources. To support compre-hensive assessment, we introduce an evaluation method that goes beyond a single end-to-end metric by jointly reporting grammatical validity, similarity, semantic alignment, and execution accuracy. Our evaluation uncovers a stark dialect gap in ISO-GQL generation: even strong LLMs achieve only at most 4% execution accuracy (EX) in zero-shot settings, though a fixed 3-shot prompt raises accuracy to around 50%, the grammatical validity remains lower than 70%. Moreover, a fine-tuned 8B open-weight model reaches 45.1% EX, and 90.8% grammatical validity, demonstrating that most of the performance jump is unlocked by exposure to sufficient ISO-GQL examples.
title Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis & Benchmark]
topic Artificial Intelligence
url https://arxiv.org/abs/2602.11745