Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jiang, Wenjia, Wang, Yiwei, Han, Boyan, Zhou, Joey Tianyi, Zhang, Chi
Format:	Preprint
Published:	2026
Subjects:	Databases
Online Access:	https://arxiv.org/abs/2602.01952
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910008710529024
author	Jiang, Wenjia Wang, Yiwei Han, Boyan Zhou, Joey Tianyi Zhang, Chi
author_facet	Jiang, Wenjia Wang, Yiwei Han, Boyan Zhou, Joey Tianyi Zhang, Chi
contents	Large Language Models have recently shown impressive capabilities in reasoning and code generation, making them promising tools for natural language interfaces to relational databases. However, existing approaches often fail to generalize in complex, real-world settings due to the highly database-specific nature of SQL reasoning, which requires deep familiarity with unique schemas, ambiguous semantics, and intricate join paths. To address this challenge, we introduce a novel two-stage LLM-based framework that decouples knowledge acquisition from query generation. In the Exploration Stage, the system autonomously constructs a database-specific knowledge base by navigating the schema with a Monte Carlo Tree Search-inspired strategy, generating triplets of schema fragments, executable queries, and natural language descriptions as usage examples. In the Deployment Stage, a dual-agent system leverages the collected knowledge as in-context examples to iteratively retrieve relevant information and generate accurate SQL queries in response to user questions. This design enables the agent to proactively familiarize itself with unseen databases and handle complex, multi-step reasoning. Extensive experiments on large-scale benchmarks demonstrate that our approach significantly improves accuracy over strong baselines, highlighting its effectiveness and generalizability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_01952
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SQLAgent: Learning to Explore Before Generating as a Data Engineer Jiang, Wenjia Wang, Yiwei Han, Boyan Zhou, Joey Tianyi Zhang, Chi Databases Large Language Models have recently shown impressive capabilities in reasoning and code generation, making them promising tools for natural language interfaces to relational databases. However, existing approaches often fail to generalize in complex, real-world settings due to the highly database-specific nature of SQL reasoning, which requires deep familiarity with unique schemas, ambiguous semantics, and intricate join paths. To address this challenge, we introduce a novel two-stage LLM-based framework that decouples knowledge acquisition from query generation. In the Exploration Stage, the system autonomously constructs a database-specific knowledge base by navigating the schema with a Monte Carlo Tree Search-inspired strategy, generating triplets of schema fragments, executable queries, and natural language descriptions as usage examples. In the Deployment Stage, a dual-agent system leverages the collected knowledge as in-context examples to iteratively retrieve relevant information and generate accurate SQL queries in response to user questions. This design enables the agent to proactively familiarize itself with unseen databases and handle complex, multi-step reasoning. Extensive experiments on large-scale benchmarks demonstrate that our approach significantly improves accuracy over strong baselines, highlighting its effectiveness and generalizability.
title	SQLAgent: Learning to Explore Before Generating as a Data Engineer
topic	Databases
url	https://arxiv.org/abs/2602.01952

Similar Items