Inside Snowflake: Building the most powerful SQL LLM in the world

Part 2: Expanding Evaluation to be user-centric...with LLMs

Authors: Daniel Campos, Filip Gralinski, Nicole Limtiaco

Improving the performance of Data Copilots using an Evaluation Copilot

In the age of large language models (LLMs), you can "talk with your data," right? Ask: what's the population of each state ordered from north to south? Just let an LLM convert it to SQL, and voila! Right?

We hate to burst your bubble, but generating SQL from natural language is not a solved problem. Super high, 90%+ accuracy claims usually use simple academic datasets and unrealistic evaluation metrics. Going from solved academic benchmarks to real databases is a complex task. At Snowflake, we maintain vast and complex datasets, some with over 50 trillion rows [1]. Our AI underlies powerful experiences like Snowflake Copilot. We are building the most powerful SQL LLM for real-world enterprise use cases. This is our experience.

Last week, we burst the bubble on data quality and complexity. We showed why commonly used academic datasets are less complex and diverse than real production databases. This week, we will dive deep into evaluation metrics showing why the industry standard "Execution Accuracy" is insufficient. Inspired by work in machine translation of human languages, we introduce a new metric, "Execution Score," which uses a language model to judge SQL code. This approach is more robust to stylistic questions, is better at the fuzzy matching of outputs, does not require access to underlying datasets, and generally compliments Execution Accuracy to provide a much fuller and more user-centric framework to evaluate the performance of our SQL language models.

Background: How can open-ended generation systems be qualified?

Let's say you've created a great language model for text-to-SQL. You want some numbers to back up your assumptions before you tell the world your system is best. How do you calculate a single evaluation score for your model?

Typically, you'd first extract a test dataset representing the task for which the model will be used. This dataset will consist of a set of natural language queries with "gold SQL" created by a human representing the "right answer." Then, you run the model on the inputs, which might be natural language questions, commands, or sentences in a given language. Your output is a bunch of generated SQL statements or translations. But how do you know if your candidates are any good? In tasks like classification, this is a straightforward comparison; look to see if classes agree. However, this can be quite hard in nuanced tasks like machine translation or SQL generation because the gold might be one of many possible translations or generated SQL statements.

In evaluation, the type of system being tested can impact how hard it is to provide effective evaluations. Classification-based systems can be quite simple and direct, while generative methods such as those used for Machine Translation or Text-to-SQL attempt to compare two enormous sets of possible valid generations.

In an ideal world, you would assemble all possible statements and translations and find candidates within this set. However, your job is not to think about infinite sets but to ship great models. Unlike the fuzzy and non-executable world of sentences, SQL statements are executable. Instead, you take the pragmatic approach, execute both the gold and candidate SQL queries, and compare the outputs. If the outputs match, it is a success; otherwise, it is a failure. Then, you calculate the accuracy across the whole test set as the success rate. This is called the Execution Accuracy (EA).

The Limits of Execution Accuracy

The simplicity of this metric is immediately apparent. Execution Accuracy let’s you prove that two seemingly different SQL statements are the same by showing that they yield the same results when executed. In that sense, Execution Accuracy is a powerful judge of semantic equivalence. EA was even used in the first text-to-SQL dataset with widespread adoption (Spider).

However, the devil is in the details, and with SQL, the details are everything. What happens if the outputs are the same, but the ordering of columns and names matters? Should these outputs be considered different? Probably not. What about row order? Sometimes, a "gold query" will include an ordering to make results more comprehensible! If ordering is optional, how should it be scored?

These issues point to a general set of challenges with automatically comparing the output of a SQL generated by your model against the output of the gold SQL. Here are the principal challenges we've seen.

Challenge 1: Real-world queries are ambiguous.

Ambiguity abounds with natural language, while SQL is specific. Let's take a natural language query like this:

Show me the total population of each state ordered from the most northern one to the most southern one.

Should a system consider the tip of the state or the center of mass? How about the capital? If systems disagree with the interpretation of a statement, how should we represent that in a metric?

Challenge 2: The "right" SQL results might be a question of style!

Even if we did know how to compute north and south unambiguously, let's consider a more stylistic question: what should the columns be? Should we punish the model for not generating this extra column or for generating another one that arguably might make sense? Imagine the query below is issued to a text to SQL system.

Much like gardeners choosing what plants belong together, SQL experts are very measured and exact about the structure of their queries. Some SQL stylists favor readability, while others favor execution efficiency or query editability. How can we measure that?

For example, some users may prefer a lot of extra columns in the result set, say if they are doing data exploration. A user preparing a report may want just the relevant columns for the point they're trying to show in the report. Suppose a user needs to communicate the results to a broader audience. In that case, they may make nice transformations on numerical values in the columns (e.g., converting from seconds to minutes or hours, rounding, formatting for human consumption). In contrast, those transformations aren't as necessary if they're the only consumers of their results.

Going back to our earlier natural language question, let's assume there is a single-turn text to the SQL engine. For that natural language request, the gold SQL might look like:

SELECT SUM(POP10), cbsa.state_name
FROM county
JOIN cbsa
ON county.geoid = CONCAT(CBSA.fips_state_code,CBSA.FIPS_COUNTY_CODE)
GROUP BY state_name
ORDER BY max(INTPTLAT) desc;

It contains no implicit information about latitude or longitude, but adding it does not feel like a big mistake (contrary to adding some random, irrelevant columns!). On the other hand, this query explicitly shows the user the latitude:

SELECT SUM(POP10), cbsa.state_name, max(INTPTLAT)
FROM county
JOIN cbsa
ON county.geoid = CONCAT(CBSA.fips_state_code,CBSA.FIPS_COUNTY_CODE)
GROUP BY state_name
ORDER BY max(INTPTLAT) desc;

Is this more explicit result set the right answer? Both statements are correct for the user's needs, but their executions will not match. Execution Accuracy says these two statements are different despite both statements meeting the user's needs. (By the way, this is a real example from US Zip Code Crosswalk at Snowflake Marketplace.)

Challenge 3: Comparing Fuzzy Output

Another difficulty when comparing just outputs is the implicit soft match and unit conversions. If a dataset stores times in milliseconds but we prefer to view times in minutes, the gold query may convert the value to minutes even if the user question didn't explicitly ask for this. EA will call this wrong if the candidate doesn't adopt this implicit optimization.

Challenge 4: Execution Accuracy requires access to the underlying data

A fundamental assumption of metrics like EA and all execution-based approaches is direct access to the underlying data and a gold query. While we have spent a lot of energy building such datasets, we sometimes need access to the underlying data when evaluating queries. Attempting to curate trusted gold queries for even a fraction of their questions would be a continuous challenge. This begs the question, how can you evaluate your product when the existing approach is structurally incompatible with real-world usage?

A new LLM-based approach: Execution Score

As we described above, Execution Accuracy can be a helpful metric when designed carefully and with access to the right environment — one in which you have a trusted set of golden queries and access to the underlying databases on which to execute those queries. In practice, we are rarely in said environment. Running queries on real customer databases is out of the question. We know our customers handle sensitive data, and we take their privacy seriously.

We looked for inspiration in other domains to help us answer this question. In machine translation, old-school word overlap metrics like BLEU and ROUGE have been displaced by metrics provided by human evaluators such as COMET or by people using Language Models to score translations. In the world of information retrieval and search, the domain has long used human evaluation to qualify the performance of systems and define what improvements are ready for release. While the use of human annotation can prove effective, it has drawbacks. Evaluation of experiments and generation of metrics runs at the speed of people, which means days or weeks to get labeled samples. Seeking to get around this recent effort has shifted data labeling from human annotation to language models with high degrees of success (Large language models can accurately predict searcher preferences). In other words, have a model check the work of another model to provide a proxy for model quality.

How Execution Score Works

Our approach to cross-check model quality using LLMs was relatively simple: generate a prompt that evaluates a single axis of model quality.

After careful and extensive iteration, we found a recipe that worked quite well for generating prompts, which is as follows:

  1. Write some form of description of what the expectation for the model is. This same prompt would make sense if you swapped the LLM for a human evaluator.
  2. Describe what the rating scale is and what each score means.
  3. Describe what kind of information the model can access and include the per sample information.
  4. Describe how the model should go about evaluation in a step-by-step process.
  5. Describe the expected output and the expected order of generation. While it may seem minor, we found performance significantly improved when the model is required to generate an explanation and even more so when its score must be based on its explanation.

One such example, prompt, shown below, focuses on how well a user's needs would be met from a generated SQL, is shown below. We found that requiring an explanation leads to improved debuggability and performance. Moreover, we found that requiring the explanation to be generated first improves the ratings' performance as the score is conditioned on the explanation instead of the converse.

You are a data analyst quality rater responsible for evaluating the 
quality of Snowflake SQL statements generated from natural language
queries by comparing the candidate SQL statement to the user intent.

You must provide a score on an integer scale of 0 to 3 with the following
meanings:
- 3 = perfect match - The candidate SQL will produce the same result
as the user intent.
- 2 = good match - The candidate SQL will produce nearly the same result
as the user intent but may suffer from non-deterministic issues such as
sorting or grouping.
- 1 = partial match - The candidate SQL will produce an output similar
to the user intent but may miss some part of the user's desired output
- 0 = no match - The candidate SQL will not produce anything similar to
the user intent.

You will have access to the following elements:
1. User Query: The user natural language query enclosed in [{query}].
2. Database Schema: Information about the database schema is enclosed
in [{db_formatted}].
3. Candidate SQL: The SQL query generated by the system is enclosed
in [{candidate_sql}].

Instructions for rating this generation:
1. Understand the user's natural language query and their overall intent.
2. Examine the database schema to determine how the database can be
leveraged to fulfill the user request.
3. Analyze the candidate SQL query.
4. Evaluate how closely the candidate SQL output would match the user
intent. Remember that the results will not be identical if the candidate
refers to columns or table names that do not exist or differ from the
user intent. Additionally, consider that in SQL, the values and exactness
of literals are critical, as minor variations in spelling or casing can
greatly impact results.
5. Assign a score from 0 to 3 based on how well the candidate SQL would
generate equivalent results to the user intent.
Provide explanations for your scores.
Please provide your assessment in a JSON dictionary with the following keys:
- E (Explanation): A detailed description of the observed equivalence between
the candidate and the user intent.
- S (Score): The numerical score you assign, where 3 indicates a perfect match,
and 0 indicates no match.
Do not include additional information in your response;
write the explanation first.

Comparing Execution Score with Execution Accuracy

We validated our LLM-based metric by measuring its agreement with EA on an internal Text2SQL dataset. While agreement was high, we sought to understand the disagreement further. To achieve this, we read through all the model outputs and judged to what extent we agreed with the EA and LLM judgments.

Most Metric disagreement is tied to the LLM's impressive ability to fuzzy match across broad and diverse domains. Execution Score suffers when the LLM ignores the importance of minor variations in literals that significantly impact the outputs of SQL statements. The first example below shows that the difference between the two statements is a literal usage of 'TX' vs. 'Texas.' While the language models view this as nearly identical, this can significantly impact the outputs of the executed query.

Conversely, the LLM's ability to fuzzy match is also a strength as it allows LLMs to fuzzy match between generated SQL statements that are different in ways that don't impact user outcomes. The second example below shows that the difference between the two statements is the numerical representation of the average time spent on the application. To the end user, both statements meet the user's need but represent the average app duration differently.

Examples of Metric Disagreement Between Execution Accuracy and Execution Score.

Comparing the Execution Score with itself for consistency

After qualifying the differences in evaluation driven by the language model, we sought to understand the underlying instability caused by using a language model. To do so, we ran evaluations using the same prompts and inputs five times for 1000 data points in one of our evaluation datasets. This experiment gives insight into the underlying variability in Execution Scores and evaluation methods that leverage an LLM.

We use these five different runs to understand how often there is variability in the underlying language model-based judgment. As shown in the table below, evaluation scores did not agree nearly 25% of the time despite having a temperature of 0 and no variation in the prompt. When we expand this and calculate confidence intervals, this metric is about 0.7%, pretty sensitive but not without faults. Moreover, comparing these labels using Kripendorff's Alpha, we find the agreement is an incredible 89%! These experiments lead us to believe that using LLMs for evaluation is most appropriate for early experimentation on a dataset with significant system variations. Using human-labeled methods may become more relevant as the gaps between systems become smaller and closer to the confidence intervals.

The breakdown of the label agreement caused my underlying language model instability. Nearly 80% of the time, the agreement is perfect.

One of our high-level findings in this exploration is that existing Text to SQL evaluation metrics have been created and focused around SQL geeks and academics. These existing users favor rigorous, execution-sensitive metrics. At Snowflake, we have a user focus; thus, our metrics focus on how well our tools meet our user needs. We believe Execution Score offers a user-centric method that we can use to improve our models without impacting our commitment to user privacy. We are exploring how to integrate this metric during our training to provide more of an active signal for model quality. Moreover, this gold-free user-centric evaluation will help us have the best model because we have the most data and can understand how our model meets user needs.

References

[1] Based on public Snowflake data as of April 30, 2023, we handle 3.9 billion daily queries on customer datasets that sometimes exceed 50 trillion rows!

--

--

Daniel Campos
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Daniel Campos is a Research Scientist at Snowflake where he works on LLMs for retrieval and data understanding