QueryCraft: Exploring LLM Performance in NL2SQL generation with Query Analysis Dashboard

Step 7: Exploring different analyses for natural language questions into structured database queries using LLM

Published in

Towards Generative AI

5 min readJun 10, 2024

What is Text-To-SQL Evaluation

In the context of Text-To-SQL (Structured Query Language), evaluation typically involves measuring how well a large language model can translate natural language questions into accurate and efficient SQL queries. It helps to understand how models can understand natural language to generate SQL queries.

Text To SQL evaluation dashboard (Image credit: author)

A Text2Sql evaluation process requires a csv file that contains the ‘db_id’, ‘question’ (natural language question), ‘query’ (golden query), and ‘model_op’, ‘model_op1’, ‘error’, and ‘result’. This ‘model_op’ is generated by the inference process using any large language model. Our Query correction service generates ‘model_op1’. The Evaluation process is generated ‘error’ and ‘result’.
To gain a deeper comprehension of evaluation metrics, we created a streamlit app for query analysis. The dashboard has functionalities to provide a better understanding of the training dataset and Text 2 SQL evaluation matrics.

Training Data Analysis:

Training data is a critical part of understanding any fine-tuning experiment. To understand the data in a better way, we analyze this data based on context length distribution, difficulty level analysis, and query classification.
Context length distribution shows which model is good for training the LLM model or inference process to generate the SQL queries with this dataset.
Difficulty level analysis shows how the query is formatted. We divided queries into 3 categories: Simple, Moderate, and Challenging based on the count of tokens and some special keywords of SQL queries.
Query classification analysis shows the count of the query type. We divided queries into SELECT, WHERE, JOIN, GROUP BY, ORDER BY, and AGG/RATIO.

Evaluation Analysis

On this tab, you will find different aspects of analysis based on the provided results.

1. Accuracy
This is a key metric for text2sql evaluation, in this model_op query will be matched against the golden_query. Some systems match the query at the token level. In our evaluation pipeline, we calculated the accuracy basis of the results. We matched the exact results which you can see as execution accuracy on the dashboard.

2. Execution Accuracy
This is another key metric. In this metric, we calculate the execution query accuracy based on how many queries are executable without any error. In this accuracy, it may be possible that the results will not match or that the query will not generate an answer, but the generated SQL query is valid.

3. Query type Accuracy
We calculated the execution accuracy of each query type. It helps us to understand which LLM model is good for generating which type of query. You can see the respective count and percentage of each query type.

4. Result Analysis
This metric generalizes execution queries based on their results. This allows us to determine how many queries return the same result or exact match. You may also examine how many queries return partial match results. We can also observe how many queries did not provide any results despite being valid SQL queries, as well as how many included errors.

5. Error Analysis
This analysis is more generalized in terms of what types of errors are we getting from generated queries. It enables us to discern whether the errors stem from syntax issues, query types, or other SQL errors. Based on this analysis, we’ve developed a Query correction module aimed at enhancing the results. You can explore this module in greater detail in our blog.

Query Classification Dashboard

On this tab, you can select the model basis of the pre-train and fine-tune models and see the all results basis of across your all experiments as a heat map. This map helps you to understand which LLM model is giving you better results for which query type.

Comparison Analysis in b/w models

In this tab, you can contrast the outcomes of two or more experiments involving Large Language Models (LLMs). Here, you can review the accuracy during execution, accuracy after execution, and partial match results. If you’re comparing a fine-tuned model with a pre-trained model, the table below displays all parameters associated with the fine-tuned model utilized during the training process.

Conclusion

Using the QueryCraft Query analysis dashboard, you can understand more about the Text-To-Sql execution framework with many analyses in one place. This dashboard helps you to make data-driven decisions, refine your experiments, and ultimately deliver more accurate and reliable Text-To-SQL conversion capabilities.

Ready to take your NL2SQL models to the next level? Explore QueryCraft’s pipeline.

Follow Towards Generative AI for more on the latest advancements in AI.