QueryCraft: How to Ingest your own Data for NL2SQL

Step 1. Data Ingestion in QueryCraft to Streamline SQL Generation

Shivam Solanki
Towards Generative AI
5 min readJun 10, 2024

--

Welcome to a deeper dive into the Data Ingestion step of the QueryCraft pipeline, an essential component of the SuperKnowa-QueryCraft framework. QueryCraft offers an innovative approach to adapting large language models (LLMs) for the generation of SQL queries from natural language, simplifying complex SQL scripting for developers and business analysts.

Why Data Ingestion Matters

Data ingestion is the first critical step in setting up a robust Text2SQL pipeline. It involves importing your data into a suitable environment where the LLM can interact with it. This process not only feeds the necessary data into the system but also sets the stage for all subsequent operations, from context retrieval to query generation and corrections.

How QueryCraft Handles Data Ingestion

QueryCraft provides flexible options for data ingestion, accommodating

  1. delimited file formats and
  2. SQLite databases.

This versatility ensures that users can integrate QueryCraft seamlessly into any of their existing tech stacks without the need for extensive modifications.

Configuring Data Ingestion in QueryCraft

I. Delimited File Ingestion:
For those utilizing cloud databases like Db2 on IBM Cloud, QueryCraft simplifies the ingestion of delimited files. Users can directly upload their data files, which are then ingested into the database to be utilized throughout the QueryCraft pipeline. This approach not only ensures that all data necessary for generating and refining SQL queries is readily available but also streamlines the setup process for subsequent modules.

The overall input and output of this service is shown below:

Input and Output of Data Ingestion service.

Here are the steps that you can follow for Data Ingestion to Db2 on IBM Cloud:

Prerequisites (First time activity):

The db2_Ingestion module offers a streamlined method for inserting data from CSV or any delimiter file into db2 to fine-tune NL2SQL pipelines.

1. First, set the following credentials in the expertConfig.ini file under the Db2_credentials section:

  • dsn_database: Name of the database.
  • dsn_uid: User ID for the database.
  • dsn_pwd: Password for the database.
  • dsn_hostname: Hostname or IP address of the database server.
  • dsn_port: Port number of the database server.
  • dsn_protocol: Protocol used for communication.
  • dsn_driver: Driver used for database connection.

2. If you don’t have delimited files for your database, which also contains a golden query dataset, you can use a file from the input/dataset folder.

Provided example dataset in QuerCraft

3. Now specify the file path, including the file name, in the simpleConfig file under the DataIngestion section. Additionally, indicate the table name that needs to be created in the db2 database. If you are using the salary.csv, TheHistoryofBaseball is the right schema. Ensuring the right schema is important as the Golden query dataset contains this information in the column db_id. This is required to run the context retriever and the execution evaluation service.

#Relative path (from home_dir) of csv file to be ingested in db2 table
#CSV file for Loading
#filename = ../input/datasets/people.csv
filename = input/datasets/salary.csv
#Schema name - Database
schema_name = TheHistoryofBaseball
# Table name for CSV data
table_name= querycraft_db2_test

If the user needs to import a file specifying the delimiter for files other than CSV, the user can adjust the delimiter from the expertConfig file:

delimiter=,

Usage:

Run the Data Ingestion module of the QueryCraft pipeline using the runQueryCraft.sh, file with the dataIngestion option after setting the simpleConfig file to insert salary.csv into the querycraft_db2_test table in db2.

sh runQueryCraft.sh

Enter the name of the component you want to run:

dataIngestion

You can validate the successful execution of the dataIngestion module from the DB2 UI as well.

Db2 UI validation for data ingestion

II. SQLite Ingestion:
Alternatively, users can opt for SQLite ingestion if they prefer a more lightweight or localized database solution. This method supports uploading a database dump in `.sqlite` format, making it ideal for portable or standalone applications of the QueryCraft pipeline.

Setting the Stage for Success

Once data is ingested, QueryCraft utilizes this foundation to perform a range of tasks:

  • Context Retrieval: Extracting relevant schema and metadata to understand the database structure better.
  • Fine-Tuning: Adjusting the LLM’s parameters to optimize SQL query generation based on the specific characteristics of the ingested data.
  • Inference and Correction: Generating SQL queries from natural language and refining them to ensure accuracy and efficiency.

The Impact of Efficient Data Ingestion

Effective data ingestion not only enhances the performance of the SQL generation process but also significantly reduces the time and effort required to prepare data for complex queries. By automating and simplifying this initial step, QueryCraft allows users to focus more on strategic tasks such as data analysis and decision-making rather than on the technical nuances of data preparation.

Conclusion

In the realm of database management and query generation, the ease and efficiency of setting up an NL2SQL pipeline are paramount. QueryCraft’s approach to data ingestion exemplifies its commitment to making advanced AI tools accessible and practical for a broad audience. Whether you are a seasoned database professional or a business analyst seeking to leverage the power of LLMs, QueryCraft provides a user-friendly and powerful toolset to transform natural language queries into precise SQL commands.

For those interested in exploring further, QueryCraft's full capabilities extend beyond data ingestion to offer a comprehensive suite of tools designed to enhance every aspect of SQL query generation and optimization.

Explore more about QueryCraft and stay tuned for our next post on the intricacies of Context Retrieval in the QueryCraft pipeline.

Follow Towards Generative AI for more on the latest advancements in AI.

--

--

Shivam Solanki
Towards Generative AI

Sr. Data Scientist | Living at the interstice of business, data and technology