QueryCraft: Context Retrieval Techniques for NL2SQL

Step 2: Exploring the Depths: Techniques, Challenges, and Breakthroughs in Context Retrieval for Text-to-SQL pipeline

Published in

Towards Generative AI

5 min readJun 10, 2024

Understanding Text-to-SQL

Text-to-SQL is a groundbreaking technology that connects human language and SQL queries is called text-to-SQL. While writing SQL queries may appear difficult for non-technical people, Text-to-SQL makes it simple for them to use natural language to connect with databases. Database querying is now easier and more accessible than ever because of to this innovative method. Users don’t need to be skilled in SQL. they may just input their queries in plain language, and the model will handle the translation into SQL.

Fine-Tuning for Text-to-SQL

Fine-tuning is very important for making text-to-SQL models work their best. We give them examples and context to help them learn and make accurate SQL queries. Sometimes, though, the model might not be perfect for user database right out of the box. So fine-tuning comes in handy. We customize it to match the user’s database.

What is the role of Context Retrieval

Context retrieval is a fundamental aspect of the fine-tuning process. It involves extracting relevant information from a database schema to enrich the training data for the text-to-SQL model. This context typically includes the schema definition (DDL) of tables, which serves as the foundation for generating SQL queries.

This module offers a convenient solution for accessing context information like DLL schema definition from both SQLite and DB2 databases. Configurable through a user-friendly setup process, it efficiently retrieves context, essential for fine-tuning models for text-to-SQL tasks.

The workflow above displays the input data, including questions, queries, and database_IDs (db_ids). This information is crucial for fine-tune the foundation model. A question represents a user query in natural language text, a query is an SQL query corresponding to the user's query, and the db_id denotes the database name (schema name) for that specific query.

The context retriever processes each query sequentially to generate context. It extracts the table name from the query and retrieves the DDL schema for that particular table. This DDL schema serves as context for the model to comprehend the table structures and generate SQL queries. This process is repeated for each query in the input file, whether it’s sqlite or db2.

Now let's go through the process of configuring and executing the context retriever service, focusing on both SQLite and DB2 data sources.

Context Retriever in QueryCraft

Pre-req steps to get started

Access to a DB2 database. ( If the context is retrieved from an DB2 database, get the db2 credentials from IBM cloud: Link)
SQLite database ( If the context is retrieved from an SQLite database)
Input data with Questions, Query and tables. Link

If you need assistance to loading the data into DB2 or establishing a connection with it, please refer to our documentation for data ingestion. Link

Step-by-Step Explanation of the Context Retriever Process

Open the superConfig.ini file and specify the following parameters in the ContextRetriever section:

input_database_folder: Relative path to the folder containing SQLite database dumps, only required for SQLite.
input_data_file: Relative path to the CSV file with columns for questions, queries, and database IDs.
db_type: Specify the db_type as sqlite or db2 to indicate the database as data source.

SQLite Context Retrieval:

Let’s begin with SQLite context retrieval. User needs two things. a CSV input file with questions, queries, and database_ids, plus a folder holding your SQLite database files. Once user run the retriever, it will give output CSV file back. This output file includes extra details, like the DLL schema as context for tables.

If you do not have a dataset of your own, we provide either of the two datasets provided, Spider dataset and KaggleDBQA dataset (Kindly unzip the given files).

To use Spider dataset configure into ContextRetriever section of SuperConfig.ini as:

input_database_folder = input/spider/database/
input_data_file = input/datasets/spider.csv
db_type = sqlite

To use KaggleDBQA dataset configure into ContextRetriever section of SuperConfig.ini as:

input_database_folder =input/KaggleDBQA/database/
input_data_file = input/datasets/kaggleDBQA.csv
db_type = sqlite

From the example above, you can specify either Spider or KaggleDBQA in the superConfig.ini file.

Context Retriever from db2:

Unlike SQLite, you don’t need to give it a database file. It grabs the DLL schema straight from your DB2 tables. Instead, you need to upload the Golden query dataset (input_data_file) for Db2. This Golden query dataset should be uploaded to the input/datasets/ folder. Here is a sample input_data_file named DB2_Context_Input.csv, and ensure that db_type is mentioned as db2.

input_database_folder =   None
input_data_file = input/datasets/DB2_Context_Input.csv 
db_type = db2

Now let’s import the context retriever module and configuration file.

# Import the context_retriever module
import context_retriever as cr

# Importing configparser module
import configparser

# Reading configuration file
config_filePath = "Config.ini"
config = configparser.ConfigParser()
config.read(config_filePath)
config.sections()

Now configure the parameters, these include the input_database_folder, input_data_file, and db_type. These parameters dictate where the context retriever will look for your database files and input data, and what type of database it will be retrieving context from.

input_database_folder = config['ContextRetriever']['input_database_folder']
input_data_file = config['ContextRetriever']['input_data_file']
db_type = config['ContextRetriever']['db_type']

Once you’ve configured the parameters, it’s time to run the context retriever. Depending on your database type, you’ll use different configurations.

For SQLite, you’ll call the funcContextRetriever function with db_type set to "sqlite" and provide the path to your input data file. Here's how you do it:

cr.funcContextRetriever(db_type = "sqlite", input_file = "input/datasets/spider.csv")

For DB2, it’s a similar process. Set db_type to "db2" and provide the path to your DB2 input data file:

cr.funcContextRetriever(db_type = "db2", input_file = "input/datasets/DB2_Context_Input.csv")

Once the context retriever has done its job, you’ll find the retrieved context details in an output CSV file. This file contains valuable information that will enhance the understanding and performance of your text-to-SQL pipeline.

Conclusion

Context retrieval techniques play a crucial role in enhancing the effectiveness of Text-to-SQL models throughout the entire inference process. By leveraging context retrieval, these models can extract crucial details from a database schema, this helps to significantly improving the performance of Text-to-SQL pipelines. This enhancement simplifies the task of querying databases from natural language inputs, making the process more accurate and user-friendly. Integrating context retriever services for SQLite and DB2 databases not only facilitates easier fine-tuning and inference but also ensures smoother and more precise SQL query generation from natural language.