QueryCraft- The Swiftest and Most Comprehensive Open Framework to Build NL2SQL at Scale

Enterprise-grade NL2SQL is hard & complex. Querycraft offers an end-to-end open source framework to generate accurate Text2SQL queries with out-of-the-box tooling to annotate, fine-tune, query-correction , inference , evaluation and analytics dashboard.

Kunal Sawarkar
Towards Generative AI

--

QueryCraft’s GitHub Repo.

The hottest programming language of the moment is English, thanks to LLMs which can generate code from natural language to any programming language. While code generation in any language is valuable for enterprise AI, a particular focus has been on creating SQL from plain English, since it’s a transformative use case, bridging the gap between business and IT.

Democratizing Data

Before GenAI, conducting ‘what if’ analysis meant navigating through Data Analytics or BI teams, resulting in delays and significant costs. Data in many enterprises remains locked within thousands of RDBMS tables, secured by stringent access controls and team of data analysts. The primary bottleneck is the need to understand complex data models and convert business requirements into SQL queries. Nearly half of an enterprise’s Analytics budget is spent just on that conversion.

Imagine a world where the need for a dedicated data analyst team is eliminated, allowing business analysts to do everything they want. With NL2SQL, the necessity for query writing expertise becomes obsolete, empowering domain experts to the most significant democratisation of data since invention of RDBMS 4 decades ago.

Why Enterprise NL2SQL is hard?

It can take you a few hours to hook up a sample CSV file to the LLM and build a PoC, but taking it to production is a different ballgame. The main reason is that for this task you often have to fine-tune your model to get better results. Fine tuning can make also small model of 8B outperforms much larger ones as shown below.

In order to fine-tune model to your dataset, typical prompt engineering isn't enough. You need to cover whole 9 yard (quite literally as shown below!)

  1. Annotation and Training Data: Biggest challenge is, even if you build MVP to do NL2SQL, how will you know it is good enough? For that you need to curate a golden dataset specific to your domain for evaluation and also an instruct-set for fine-tuning.
  • Creating Golden Datasets: Developing high-quality annotated datasets is labor-intensive. It need to cover a wide range of possible complex queries and their corresponding SQL translations.
  • Annotation Quality: Ensuring consistent and accurate annotations is difficult but crucial for effective model training. You need to employ reliability metrics like IRR (Inter-Rater Relevancy) for it to be trustworthy.

2. Complex Metadata Structures:

  • Data Models: Understanding intricate data models and mapping natural language queries to the appropriate tables and columns is difficult.
  • Cryptic Column Names: Column names in large databases can be cryptic and non-descriptive (like SRT_712), making it hard for AI to accurately interpret the intent behind a user query.

3. Volume of Data: Enterprises typically have vast amounts of data spread across hundreds of tables. This scale adds complexity in terms of ensuring fast query generation.

4. High Accuracy Requirements: Semantic Understanding — Accurately interpreting the symantic of natural language queries and translating them into precise SQL queries which are executable is a non-trivial task. Unlike English, each DB has their own SQL Grammar.

6. Fine-Tuning:

  • Complexity: Fine-tuning models by understanding specific hyper-parameter setting like r values, optimisation on GPU, error rate etc. with various methods is challenging even for data scientist

7. Scalability and Inference Speed: Generating SQL queries efficiently at scale is essential yet complex.

8. Evaluation Framework: Establishing reliable metrics for evaluating the performance of NL2SQL is non-trivial. It involves assessing both syntactic accuracy (correct SQL syntax) and semantic accuracy (correct interpretation of user intent).

9. Continuous Improvement: Continuously improving the NL2SQL system based on feedback and performance metrics requires an ongoing dashboard to monitor and course correct.

Overall, scaling NL2SQL applications in an enterprise demands addressing it through sophisticated algorithms, robust infrastructure, and continuous optimization.

What if a single framework can handle all of the above?

Introducing QueryCraft

To tackle these challenges, we created the SuperKnowa QueryCraft framework, designed for AI engineers to rapidly build NL2SQL solutions. QueryCraft is one of the first high abstraction open-source framework dedicated to creating comprehensive end-to-end NL2SQL pipelines. It makes life easier by allowing users to glue above pipeline like Lego pieces.

While currently optimized for SQL databases, it also supports graph queries like Neo4j’s Cypher. It comes with out-of-box capabilities like-

  • Ready-to-use fine-tuning steps with default settings optimized for these challenges, with the flexibility for experts to define their own LoRA or Q-LORA settings.
  • A post-processing algorithm that automatically corrects generated SQL queries, significantly boosting accuracy.
  • Multiple inference styles: batch, VLLMS, multi-threaded, or Ray.
  • Automated visualization of experiment results in a dashboard, sort-able by SQL query type and complexity
Fig- QueryCraft Modules

QueryCraft has 9 modules which can be put together to build your pipeline

1. Annotation of NL2SQL- SMEs can annotate business/use case specific NL2SQL pairs to be used for evaluation and fine-tuning (aka Golden Set).

2.Database: An engine to execute SQL’s. Prebuilt connectors for SQLLite, DB2, and csv conversion.

3.Context Retriever: extracts the context given a CSV with the question, golden query, and DB ID and creates a dataset for fine-tuning.

4.Fine-Tuning: Either use a pre-trained LLM or fine-tune using a PEFT technique (LoRA or QLoRA) on your Tuning dataset.

5. Inference: Module to generate inference in batch, distributed or VLLM.

6. Query correction: Module to fix issues in the generated SQL

7. Evaluation: Measure execution accuracy to evaluate efficacy of model

8. Analytics Dashboard: Tool to analyze performance by slicing & dicing queries

9. UI/UX : The end user interface to consume the pipeline like ChatSQL

Pick any LLM or Inference Framework

Choose any of the popular options with any underlying LLM to construct NL2SQL applications using a single config file. Querycraft is a high-abstraction framework that makes your NL2SQL pipeline work easier without locking you into any architectural choices. Querycraft includes presets for CodeLlama, Llama, and IBM Code Granite models (which surpass leaderboard benchmarks for NL2SQL) tasks but can work with any model on HuggingFace.

With QueryCraft, we achieved nearly 90% accuracy on private complex real-life datasets and scaled to millions of query generation.

Getting Started with QueryCraft

Get going by starring the QueryCraft Github Repo. You will need dataset and can use either of the three options:

  • Bring your own dataset, in csv, SQLite or DB2 format or convert to csv
  • Use the annotation tool for curating your dataset
  • Use the example database provided for testing (like Spyder)

Then all you need to do is make your feature selections inside a single config file.

Config File comes with two modes, Easy and Expert!

Easy Mode

  • simpleConfig: Use this to specify the paths for datasets, models, and other services as per your setup. This is a great option for AI Engineers for quick baseline, using best preset hyper-parameters for fine tuning options without worrying too much.

Expert Mode

  • expertConfig: Use this to configure the parameters of your experiment including fine-tuning. You can define your own values of r and other memory setting and is recommended option for data scientists.

Using QueryCraft for Building NL2SQL Applications

Here is how you can build your pipeline

Step Zero- Annotation Tool

In case you don't already have it, create your own eval set and instruct set with our annotation tool. Detailed blog,

The tool is designed to scale with multiple raters and can calculate IRR (Inter-Rater Relevancy) automatically using Fleiss-Kappa Agreement Score to give you a trustworthy golden set.

Step 1- Data Ingestion

For those utilizing cloud databases like Db2 on IBM Cloud, QueryCraft simplifies the ingestion of delimited files. Users can directly upload their data files, which are then ingested into the database to be utilized throughout the QueryCraft pipeline. You are welcome to use any DB of your choice as well by changing underlying configuration.

Data Ingestion Module

Detailed blog

Step 2- Define your Context Retriever

Next is build your context retriever to translate NL to SQL

The workflow above displays how use intent is understood and query is created. This information is crucial for fine-tune the LLM in the next step to your own custom set.

Step 3 — Fine-Tune for your SQL Dataset

Now comes the crux of the matter, fine tuning model to your data. Scale up GPU and find right parameters. To make life easy we already done hundreds of experiments so that you can directly use best hyper-paramters setting for NL2SQL for LoRA and Q-LORA. And if you feel like going deeper just modify config file and setup your own values for r, dropout etc.

Step 4— Inference Service

Infer in batch or sequential mode or in distributed mode using Ray.

Step 5 — SQL Query Corrector

One of the challenge with LLM generated SQL queries is that they can be syntactically incorrect or have minor inconsistencies. In order to fix it QueryCraft pipeline has a Query Corrector service to fix SQL for most common cases. We have seen 10–20% accuracy boost with this step.

Step 6- Query Evaluation Framework

Next Step is to see how accurate generated queries are. While there are many ways one can measure similarity, we found that those have little practical significance. A correct query can be written in multiple ways. What matters if it correctly understood business context and gave expected output. So we build module to do black-box testing by calculating execution accuracy.

Detailed steps here

Step 7 — Accuracy Analytics Dashboard

Its one thing to get overall accuracy at 80% but it may hide dirty secrets. Its a trivial job for LLM to generate SELECT * type queries but it struggles to do complex analytical query generation like joins with aggregates or ranking function. To drill down you obviously need a tool which can help you with it.

QueryCraft Open Stack

You must be wondering whats inside QueryCraft. Its an open technical stack built on top of RedHat and can be deployed anywhere. It can run on-premise or on cloud.

With QueryCraft, we don’t just craft queries; we craft excellence at scale. Our evaluation framework is engineered to push the boundaries of Text2SQL, ensuring that every generated SQL query is syntactically precise, semantically accurate, and operationally effective.

Star QueryCraft’s GH repo for source code and share your feedback, either via comments or raising an issue to the GH repo.

Follow Towards Generative AI for more on the latest advancements in AI.

--

--

Kunal Sawarkar
Towards Generative AI

Distinguished Engg- Gen AI & Chief Data Scientist@IBM. Angel Investor. Author #RockClimbing #Harvard. “We are all just stories in the end, just make a good one"