Inside Snowflake: Building the most powerful SQL LLM in the world

Part 1: All evaluation data sets are wrong, some are useful.

At Snowflake, our goal is to build the world’s best AI for enterprise data. We are developing language models that not only power new conversational experiences in Snowflake, but also let our customers program generative AI into the vast, and complex datasets maintained by Snowflake — sometimes exceeding 50 trillion rows! [1] This blog post marks the beginning of a four-part series, guiding you through our journey in creating Snowflake Copilot, the world’s leading text-to-SQL large language model (LLM). Our initial focus is a crucial aspect of this venture: building robust data-sets to evaluate the LLMs we’re building.

Wait. Isn’t this already solved?

Text-to-SQL seems straightforward, right? You just need a list of (NL, SQL) pairs [2]. For example, here are some such pairs from Spider, an early open-source datasets from Yale which has been a standard for text-to-SQL benchmark for many years.

Example of Spider Dataset. Source: https://yale-lily.github.io/spider

This looks great! Can’t you then simply train your model and call it a day? In fact, a few months ago, we did just that, fine-tuning Meta’s CodeLlama, and nearly matching GPT-4 on this dataset. We’re done, right?

Problem 1: Widely-used SQL Evaluation Test Sets are Unrealistically Simple

… well no. Spider’s databases are quite unrealistic. The clean, crisp, and simple databases are a far cry from the complex real-world database schema we see in production use cases. We need to evaluate large, messy DBs with a lot of tables and columns, not necessarily named in a consistent manner, and Spider is unfortunately not like this.

Recognizing these limitations, some researchers released Spider remixes, such as Spider Realistic or Spider-Syn, but the improvements were limited to the development set. The more recent open-source dataset, BIRD-SQL, was an attempt to start from scratch, and it is, to some extent, a better, harder, more realistic data set. However, we are still far off from the complexity of a real enterprise dataset.

Look at how much richer and more complex the Snowflake Marketplace dataset is — e.g. one of the most popular database in Snowflake Marketplace, COVID-19 Epidemiological Data has 43 tables.

Problem 2: Snowflake SQL has unique capabilities

Another issue is that Spider and Bird are specifically about the SQLite dialect of SQL. And we needed a dataset tailor-made for Snowflake SQL. Some unique properties of Snowflake SQL relative to SQLite are:

  • SnowflakeSQL has richer functions, like those for formatting timestamps. By comparison strftime in SQLite is much more difficult to use.
  • Snowflake SQL handles NULLs more intelligently. For instance, when using averaging (avg function) SQLite converts NULLs to zeros, while Snowflake SQL just considers non-NULL values. (Better be careful when calculating the mean age of your team in SQLite if some of the values are unknown!)
  • Snowflake SQL has special syntax for semi-structured data, such as manipulating JSON.

These features are what makes SnowflakeSQL such a rich language for data understanding and transformation, and we want to capture all these rich possibilities in our large language models!

Problem 3: We’ve learned all we can from past datasets

The existence of widely-used datasets have propelled incredible progress in machine learning, and Spider is no exception. The problem is that we’ve learned almost all we can from this dataset. The best systems reported in the literature are approaching 80% on Exact Match Accuracy [3] and over 90% on execution with values. At this point, improving over this ceiling is just overfitting to Spider’s quirks and wrong samples.

https://paperswithcode.com/sota/text-to-sql-on-spider

Towards state-of-the-art evaluation

The implications are clear — we need to create our own SQL dataset. Luckily, Snowflake has some unique advantages in doing this: we have our own SQL engine, our ecosystem includes the Snowflake Marketplace and Streamlit, an amazing tool for building ML apps like modeling dashboards or annotation UIs. Let’s dive into how we built our own dataset!

Challenge 1: We want to reflect real-world use cases, but can’t train on customer data

As mentioned above, real-world schemas are very complex, and we’d like to reflect that complexity in our dataset. However, we’re not allowed to train models on customer data. In fact, as members of the product team, it’s impossible for us to even see customer data! To mimic the complexity of the real world then, we leveraged a powerful part of the Snowflake ecosystem — Snowflake Marketplace, where hundreds of providers share their datasets with Snowflake users. At the end of the day, these are real-life Snowflake DBs just like any other! And the diversity of Marketplace is really high, in terms of both domains or businesses and DB schema complexity and hardness. After vetting, we selected a subset of 24 Marketplace DBs.

Challenge 2: Reflect a semantic understanding of the data, at various levels of complexity

Then for each of the databases a SQL-savvy data analyst first took their time to deeply understand its schema and all the intricacies (or it might have been a person already knowing a given DB). Well, it is critical that you understand both the database schema and the table data well before attempting to write any NL/SQL examples on it. It’s unlikely that your SQL will correctly answer the NL question if you don’t understand the nuances of the schema and the data, e.g. whether a table has a primary key, whether joining two tables on a key results in a many-to-many join, whether a column has duplicate values, etc.

After an annotator had a good feeling of a database, they were asked to come up with a number of examples of varying difficulty — of course you want hard test samples, but do not forget about providing examples of lower or medium hardness, it is not that impossible to train a model that tends to generate too long or too convoluted SQLs! For some examples we also used LLMs for inspiration to get some plausible NLs. Anyway, all the examples were double checked, starting from automated checks, whether SQL actually executes and returns some rows (no rows in the output is usually suspicious), and ending with manual checks.

Challenge 3: Creating workflows and tools to let humans annotate data and benchmark results

Creating a good single item of NL and its SQL equivalent is one thing, but another thing is that the test set should make sense as a whole. Basically, we should have the right distribution or balance in terms of sample hardness, as already mentioned. But also you need a good coverage of expression types: do we have the right number of joins? subqueries? GROUP BYs? aggregations? specific functions? more advanced, maybe Snowflake-specific, operators?

Fortunately, with Streamlit in Snowflake (currently in public preview) you can easily create dashboards for exploratory data analysis of test sets to keep an eye on what’s good and what’s bad there. Furthermore, we keep our test sets and their versions and variations in a Snowflake DB, so that a number of people could collaborate on preparing and validating a test set. Treating an evaluation data set as data living in a database, just as any other data, simply makes your life easier.

Looking for a better dataset… with a Streamlit app created on the go

Challenge 4: Testing the tester. How can we evaluate the dataset itself ?

An evaluation test set is a touchstone for a text-to-SQL model, but it can work in the opposite direction — a model (even a baseline or vanilla model) can be a touchstone for an evaluation test set. What kind of score do you get? 0% is not good, but also 99% does not mean that your test set is good. You can go deeper, you can look at all of the broadly understood features (does the gold SQL contain a join? does the predicted SQL use avg function? does the NL contain the word “and”?, etc.) that correlate with your evaluation score going suspiciously down. Sometimes it is just about the model and its deficiencies, or a trivial error in pre- or post-processing procedures. But sometimes the problem is actually in your test set.

As a part of our Streamlit dashboard we print a list of “trouble-making” features of our test samples. Technically, we’re calculating p-values according to Mann-Whitney U-test and sort them starting with the tiniest ones. This way you can find features that are infrequent, but have huge impact on your scores, or frequent features that almost always have some impact. Both cases are interesting! It is a neat trick, surprisingly underappreciated even by seasoned ML experts. For instance, we discovered that items coming from one of the databases were all failing on a number of baseline models. It turned out that the standard models were mishandling mixed-case table or column names, which should be put in double quotes in Snowflake SQL. We had not been aware of this problem!

Looking for “troublemakers” in the data set, again with a Streamlit app

Looking forward

Paraphrasing military theorists, no test set survives first contact with end users. In a way, it is a never-ending process, when you hone your evaluation data set to be a better match for real needs of our end-users. We are eager to get insights from first Snowflake Copilot PrPr users to have even better test sets. That’s why we set up the whole telemetry to get all sorts of feedback signals from the end-users, starting from thumbs up/down and ending with more sophisticated analyses of what was generated by our model and what the user actually did. And again, all these signals, after the proper redaction procedures, are stored into a Snowflake DB, where we can analyze them and improve the evaluation data set accordingly. Contrary to academic or open-source ML challenges, a test set in a real AI product cannot be static, it’s a living thing, changing with new users and capabilities. Our test set follows our users in their daily discoveries of how they can use their own data — sometimes getting from Snowflake what they always wanted, but, before Snowflake Copilot, “were afraid to ask”.

References

[1] Based on internal Snowflake data as of April 30, 2023, we handle 3.3 billion queries a day on customer datasets that sometimes exceed 50 trillion rows!

[2] “NL” means natural language, like English.

[3] https://paperswithcode.com/sota/text-to-sql-on-spider

--

--