Evaluation Framework for NL2SQL Generation: QueryCraft

Step 6. Elevating NL2SQL with QueryCraft’s Evaluation Framework

Published in

Towards Generative AI

6 min readJun 10, 2024

Highlighted Eval framework of the QueryCraft pipeline

Accurate evaluation is just as crucial as the initial model training when refining the capabilities of large language models (LLMs) for NL2SQL tasks. We understand this need and have crafted an innovative evaluation framework in QueryCraft to rigorously assess and refine our NL2SQL pipeline. Our framework consists of three pivotal components: Query Correction, Execution Evaluation, and the Query Analysis Dashboard.

1. Query Correction: Polishing SQL to Perfection

The path to perfect SQL generation is fraught with syntactic snares. Our Query Correction service is the polish that smoothes out these rough edges. By rectifying syntax-level inaccuracies not caught by the LLM model, we enhance the query’s suitability for subsequent evaluation phases, ensuring the SQL output is ready to face real-world databases.

You need to use the LLM to generate inference (SQL queries) on your golden dataset (containing natural language and SQL pairs). This serves as the input to the Query Correction service (as shown in the image below).

In our case, this is a CSV file with the columns question (natural language question), context (database schema), query (SQL query), and model_op (Model output or generated query). The service works its magic, tweaking and tuning the SQL until it’s structurally sound. Please read our detailed blog on the Query Correction service to learn how it works.

Execution Evaluation: The Litmus Test of SQL Usability

Accuracy in SQL generation is meaningless without executable queries that yield the correct results. Our Execution Evaluation service steps in as the litmus test, analyzing the NL2SQL results based on raw inference and the refined output post-query correction.

This service computes execution accuracy by running the generated queries against the database and comparing the results with those generated by the golden queries. The metrics include exact matches, partial matches, and an insightful breakdown of errors, giving ML engineers the feedback necessary to iterate and improve.

A Peek Into the Execution Evaluator’s Mechanics

The following is the input of the Execution Evaluator service of QueryCraft:

Golden query dataset, which contains the natural language question and SQL query pairs
Database against which the service will run the queries

Once you provide these two essential information to run the Execution Evaluator service on your database, it can generate the following outputs:

Input-Output of Execution evaluator service

Inference Accuracy: An evaluation that measures the query’s accuracy pre-query correction.
Post-Processing Accuracy: An assessment that encapsulates the accuracy after the query correction process has refined the generated SQL.

Working principle of our Execution Evaluator Module:

Now that we understand the inputs and outputs of the Execution Evaluator module, let’s dive deep into the logic behind the scenes to calculate execution accuracy.

Error Handling:
A fundamental part of evaluation is error handling. The error_handling function catches common SQL errors such as missing columns, syntax mistakes, and ambiguous names, categorizing them into a human-readable form. This function enhances the evaluator's robustness by gracefully capturing and reporting SQL execution issues.

Query Reformatting:
The reformat_query function addresses a specific normalization need. It replaces table-specific wildcards liket1.* with a generic `*`. This normalization is necessary because different SQL generators may have varying conventions for representing queries, and the evaluator must compare like with like.

SQL Validation:
Before comparing results, we must ensure the query is executable. The isValidSQL function attempts to run the SQL command. If it executes without error, the query is considered valid, signifying it has passed the first gate of evaluation.

Unordered Row Comparison:
The unorder_row function sorts the elements within each row irrespective of their order, converting the row into a normalized tuple. This is particularly useful for comparing query results where the order of columns is not fixed, but their contents matter.

Quick Rejection:
The quick_rej method provides a swift pathway to reject non-matching results. It compares sorted versions of result rows, either respecting their order (for ordered queries) or as unordered sets (for queries where the order is irrelevant).

Permutation Generation:
For more nuanced cases, get_constraint_permutation and permute_tuple engage in a sophisticated match-check by considering all possible permutations of the result sets, accommodating scenarios where the columns might be returned in a different order but still represent the same data.

Multiset Equality:
multiset_eq checks whether two lists contain the same elements, including the same number of duplicates, which is crucial for SQL queries where the result may include repeating rows.

Comprehensive Result Comparison:
Finally, the result_eq function is the centerpiece that orchestrates the comparison process, employing all the aforementioned helpers to ascertain whether two result sets are equivalent, accounting for both the presence and order of rows and columns as necessary.

Evaluation Calls:
- eval_exec_match_sqlite and eval_exec_match_db2 serve as wrappers to apply the evaluator logic to SQLite and IBM DB2 databases, respectively.
- query_processing cleans and reformats the queries and model outputs, preparing them for evaluation.
- formaterAndCaller_sqlite and formaterAndCaller_db2 are the higher-level functions that format the queries, call the evaluation functions, and collect the scores.

The output of the Query correction service serves as the input of the Execution evaluator service (as shown in the image below). This input is a CSV file with the following columns: question (natural language question), context (database schema), query (SQL query), model_op (Model output or generated query), and optionally, model_op1 for the query correction output.

3. Query Analysis Dashboard: Insights at a Glance

Evaluation data is only as good as the insights it offers. Our Query Analysis Dashboard encapsulates this ideology by serving as a one-stop visualization tool for examining generated queries, categorizing inaccuracies, and benchmarking the results across multiple LLMs. It’s an invaluable resource for identifying areas for model improvement with features that include:

Query Type Analysis: Analyze where your model excels and where it stumbles.
Error Analysis: Pinpoint and categorize the exact nature of errors in query generation.
Result Analysis: Assess the overall success rate of query execution.
Execution Accuracy Metrics: Measure the precision of your model’s output before and after query correction.

The output of the Execution evaluation service is fed into the Query analysis dashboard. This is a CSV file containing the evaluation service results with columns like question, context, query, model_op, and optionally, model_op1, error_type, and result.

Please read our detailed blog on the Query Analysis Dashboard to learn more about it.

Conclusion

In conclusion, QueryCraft’s evaluation framework empowers developers to refine their NL2SQL pipeline rapidly. By incorporating Query Correction, Execution Evaluation, and the Query Analysis Dashboard, our framework provides a comprehensive suite of tools to identify and rectify errors, measure success, and gain valuable insights into model behavior.

This multifaceted approach ensures that generated SQL queries are not just syntactically correct but also functionally accurate and executable against real-world databases. The Query Analysis Dashboard’s clear visualization capabilities further accelerate the development process by allowing ML engineers to pinpoint areas for improvement and benchmark the performance of various large language models.

QueryCraft’s evaluation framework is a powerful asset for anyone developing or refining NL2SQL models. By leveraging its capabilities, you can ensure that your models generate grammatically correct SQL and queries that deliver the desired results on real databases.

Ready to take your NL2SQL models to the next level? Explore QueryCraft’s evaluation framework today and unlock the full potential of your LLMs!

For a deeper dive into the specific functionalities of each evaluation component, please refer to our detailed blogs on Query Correction, Execution Evaluation, and the Query Analysis Dashboard.

Follow Towards Generative AI for more on the latest advancements in AI.