Automating NL2SQL Execution Accuracy Measurements: QueryCraft

Step 6B. A Step dedicated to capturing evaluation of Execution Accuracy in NL2SQL

Shivam Solanki
4 min readJun 10, 2024

The frontier between natural language and database querying is rapidly advancing with the integration of Large Language Models (LLMs) into SQL generation tasks. Central to this integration is the robust evaluation of the generated queries. Enter the Execution Evaluator — a cornerstone component within the SuperKnowa QueryCraft, designed to meticulously assess the practical accuracy of SQL queries generated from text descriptions.

Highlighted Eval Framework of the QueryCraft pipeline

Introducing the Execution Evaluator

The Execution Evaluator is a sophisticated module poised to evaluate the precision of SQL queries crafted from natural language inputs within the NL2SQL pipeline. The module accomplishes this by analyzing the execution accuracy of queries against a database, providing a quantitative assessment that transcends mere syntactic correctness.

Why Execution Accuracy Matters

Execution accuracy transcends the confines of syntactic validation, venturing into the realm where queries are gauged based on their semantic integrity. This accuracy ensures the SQL queries, once executed, yield results that faithfully align with the intended user queries — essentially embodying the true essence of what was asked.

Let’s dissect the essential methods and understand their roles in evaluating execution accuracy.

A Peek Into the Execution Evaluator’s Mechanics

The following is the input of the Execution Evaluator service of QueryCraft:

  • Golden query dataset, which contains the natural language question and SQL query pairs
  • Database against which the service will run the queries

Once you provide these two essential information to run the Execution Evaluator service on your database, it can generate the following outputs:

Input-Output of Execution evaluator service

Inference Accuracy: An evaluation that measures the query’s accuracy pre-query correction.
Post-Processing Accuracy: An assessment that encapsulates the accuracy after the query correction process has refined the generated SQL.

Working principle of our Execution Evaluator Module:

Now that we understand the inputs and outputs of the Execution Evaluator module, let’s dive deep into the logic behind the scenes to calculate execution accuracy.

Error Handling:
A fundamental part of evaluation is error handling. The error_handling function catches common SQL errors such as missing columns, syntax mistakes, and ambiguous names, categorizing them into a human-readable form. This function enhances the evaluator's robustness by gracefully capturing and reporting SQL execution issues.

Query Reformatting:
The reformat_query function addresses a specific normalization need. It replaces table-specific wildcards liket1.* with a generic `*`. This normalization is necessary because different SQL generators may have varying conventions for representing queries, and the evaluator must compare like with like.

SQL Validation:
Before comparing results, we must ensure the query is executable. The isValidSQL function attempts to run the SQL command. If it executes without error, the query is considered valid, signifying it has passed the first gate of evaluation.

Unordered Row Comparison:
The unorder_row function sorts the elements within each row irrespective of their order, converting the row into a normalized tuple. This is particularly useful for comparing query results where the order of columns is not fixed, but their contents matter.

Quick Rejection:
The quick_rej method provides a swift pathway to reject non-matching results. It compares sorted versions of result rows, either respecting their order (for ordered queries) or as unordered sets (for queries where the order is irrelevant).

Permutation Generation:
For more nuanced cases, get_constraint_permutation and permute_tuple engage in a sophisticated match-check by considering all possible permutations of the result sets, accommodating scenarios where the columns might be returned in a different order but still represent the same data.

Multiset Equality:
multiset_eq checks whether two lists contain the same elements, including the same number of duplicates, which is crucial for SQL queries where the result may include repeating rows.

Comprehensive Result Comparison:
Finally, the result_eq function is the centerpiece that orchestrates the comparison process, employing all the aforementioned helpers to ascertain whether two result sets are equivalent, accounting for both the presence and order of rows and columns as necessary.

Evaluation Calls:
- eval_exec_match_sqlite and eval_exec_match_db2 serve as wrappers to apply the evaluator logic to SQLite and IBM DB2 databases, respectively.
- query_processing cleans and reformats the queries and model outputs, preparing them for evaluation.
- formaterAndCaller_sqlite and formaterAndCaller_db2 are the higher-level functions that format the queries, call the evaluation functions, and collect the scores.

Conclusion

The Execution Evaluator stands as a critical pillar within the NL2SQL pipeline, ensuring the usability and accuracy of the generated SQL queries. By incorporating comprehensive error handling, flexible result comparison, and extensive normalization, it offers a thorough examination of the model’s output.

The Execution Evaluator guarantees that the final output is not just syntactically correct but also contextually accurate, capable of interacting with databases to fetch the correct data, thus validating the practical application of the generated queries. Therefore, the Execution Evaluator forms the foundation of the Evaluation framework of our NL2SQL pipeline.

In essence, this evaluator transcends beyond a mere correctness check — it endows the NL2SQL pipeline with the reliability and robustness necessary for real-world applications, marking a significant milestone in the field of natural language to database querying.

The next step is to analyze the generated queries using our Query Analysis dashboard. Learn more about it in the next blog.

Follow Towards Generative AI for more on the latest advancements in AI.

--

--