Saved in:
Bibliographic Details
Main Authors: Nguyen, Dai Quoc, Hoang, Cong Duy Vu, Vu, Duy, Tangari, Gioacchino, Vu, Thanh Tien, Dharmasiri, Don, Li, Yuan-Fang, Duong, Long
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.16747
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909617324294144
author Nguyen, Dai Quoc
Hoang, Cong Duy Vu
Vu, Duy
Tangari, Gioacchino
Vu, Thanh Tien
Dharmasiri, Don
Li, Yuan-Fang
Duong, Long
author_facet Nguyen, Dai Quoc
Hoang, Cong Duy Vu
Vu, Duy
Tangari, Gioacchino
Vu, Thanh Tien
Dharmasiri, Don
Li, Yuan-Fang
Duong, Long
contents Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios for the NL2SQL task. SQLong generates augmented datasets by extending existing database schemas with additional synthetic CREATE TABLE commands and corresponding data rows, sampled from diverse schemas in the training data. This approach effectively simulates long-context scenarios during finetuning and evaluation. Through experiments on the Spider and BIRD datasets, we demonstrate that LLMs finetuned with SQLong-augmented data significantly outperform those trained on standard datasets. These imply SQLong's practical implementation and its impact on improving NL2SQL capabilities in real-world settings with complex database schemas.
format Preprint
id arxiv_https___arxiv_org_abs_2502_16747
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SQLong: Enhanced NL2SQL for Longer Contexts with LLMs
Nguyen, Dai Quoc
Hoang, Cong Duy Vu
Vu, Duy
Tangari, Gioacchino
Vu, Thanh Tien
Dharmasiri, Don
Li, Yuan-Fang
Duong, Long
Computation and Language
Artificial Intelligence
Machine Learning
Software Engineering
Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios for the NL2SQL task. SQLong generates augmented datasets by extending existing database schemas with additional synthetic CREATE TABLE commands and corresponding data rows, sampled from diverse schemas in the training data. This approach effectively simulates long-context scenarios during finetuning and evaluation. Through experiments on the Spider and BIRD datasets, we demonstrate that LLMs finetuned with SQLong-augmented data significantly outperform those trained on standard datasets. These imply SQLong's practical implementation and its impact on improving NL2SQL capabilities in real-world settings with complex database schemas.
title SQLong: Enhanced NL2SQL for Longer Contexts with LLMs
topic Computation and Language
Artificial Intelligence
Machine Learning
Software Engineering
url https://arxiv.org/abs/2502.16747