What question did this study set out to answer?

The aim is to enhance the ability of models to generate valid SQL queries from natural language by improving schema understanding.

April 10, 2026Open Access

SchemaRAG: A Schema-aware Retrieval-Augmented Generation Framework for Text-to-SQL

Key Points

The aim is to enhance the ability of models to generate valid SQL queries from natural language by improving schema understanding.
Developed SchemaLinker for aligning natural language with database schema items through knowledge distillation.
Created a schema-augmented retriever to fetch relevant examples based on database schema.
Implemented a Pareto-optimal selection mechanism to choose the best SQL query from candidate queries.
SchemaRAG significantly outperformed four leading Text-to-SQL models in generating valid SQL queries.
Extensive experiments demonstrated improved robustness and accuracy across five benchmark datasets.

Abstract

Text-to-SQL refers to the task of converting natural language queries into Structured Query Language (SQL), enabling users to interact with databases without knowing SQL. Large Language Models (LLMs) have demonstrated considerable potential in implementing Text-to-SQL through retrieval-augmented generation and prompt engineering. However, these methods still face challenges in effectively understanding complex database schemas, failing to produce valid SQL queries. To address this issue, this paper proposes a Schema-aware Retrieval-Augmented Generation (SchemaRAG) framework with three core components. First, a SchemaLinker is fine-tuned to align natural language with schema items by knowledge distilling from high-quality chain-of-thought data, where its reasoning capabilities are further refined through group relative policy optimization. Second, a schema-augmented retriever is designed to retrieve the most relevant examples by referencing the database schema, thereby enhancing the LLM's ability to understand and generate SQL syntax. Finally, SchemaRAG adopts a Pareto-optimal selection mechanism to identify the final SQL query from a set of high-quality candidates to enhance robustness. As such, SchemaRAG can effectively learn complex database schemas to syntactically align with the structures of SQL, thereby generating more valid SQL queries. Extensive experiments on five benchmark datasets are conducted across several mainstream LLMs. The results demonstrate that SchemaRAG significantly outperforms four state-of-the-art Text-to-SQL competitors. The source code, datasets, and appendix of this paper are available at https://github.com/chelsea2002/SchemaRAG.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Di Wu

Zetong Tang

Yi He

Journals

Proceedings of the ACM on Management of Data

Actions

Institutions

William & Mary

Southwest University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SchemaRAG: A Schema-aware Retrieval-Augmented Generation Framework for Text-to-SQL

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study