Inside Snowflake: Building the most powerful SQL LLM in the world

Part 3: Retrieve, Prompt, Generate

Snowflake Copilot helps you navigate your organization’s data.

Introduction

At Snowflake, we’re building a powerful SQL LLM for real-world enterprise use cases. This LLM supports Snowflake Copilot (currently in private preview), an LLM-powered assistant to generate and refine SQL with natural language. To build Copilot, we pair our SQL LLM with a powerful search system over database metadata, and insert the results into the prompt. In this article, we go over one of the specific decisions we had to make when building our system: how many tables and columns should we inject into the prompt?

Background

LLMs perform really well on benchmark datasets like Spider at generating useful SQL from text, but as we’ve discussed before, this performance doesn’t easily translate to real-world performance. This is for many reasons, but perhaps the most salient: real-world databases are orders of magnitude bigger than benchmark datasets like Spider. Snowflake customers routinely have tens of thousands of tables, which frequently have dozens or even hundreds of columns.

For an LLM-powered assistant like Snowflake Copilot to be able to generate SQL in response to a user prompt, it needs to know what tables are in the database, and what information those tables contain. LLMs have limited context windows; for real-world databases, there is no way to inject the entire schema into the prompt. Therefore, we need to figure out which tables, and which columns in those tables, might be relevant to answering the user’s question.

The Core Workflow: Retrieve-Prompt-Generate

How do we figure out which tables might be relevant to answering the user’s question? This is where our Universal Search (currently in private preview) system comes in. We index all the tables accessible to a user using our in-house indexing system (brought to Snowflake via the Neeva acquisition).

Building this search system is non-trivial, and requires some clever modeling and engineering. For most user queries, keyword-based retrieval is not good enough, and vector-based semantic retrieval is needed — the search system needs to “understand” the user query and the database metadata, so it can figure out which tables are likely to be useful. However, off-the-shelf embedding algorithms don’t work very well for searching over tables, since they look very different from traditional “documents”.

We’ll share more about how we built this search system in future posts, but for this post, let’s assume we have a reasonably good retriever. The next step is to construct a prompt to our SQL generation LLM.

As a concrete example, let’s imagine our user is querying the Snowflake database. This is a system-defined database that contains metadata and historical usage data about the objects in an organization’s account. This is a powerful tool for Snowflake administrators to understand the use of Snowflake within their organization.

When our user enters a message into Snowflake Copilot saying “I want to know the top 10 users who have executed queries last month”, the workflow is as follows:

  • We pass this user message to the retriever, which returns a ranked list of tables (and ranked columns within each table).
  • We then construct a prompt like the below, to provide the database schema to the LLM, and ask it to generate SQL in response.
Here is a database schema:
| QUERY_HISTORY :
QUERY_ID [TEXT], QUERY_TEXT [TEXT], DATABASE_NAME [TEXT],
SCHEMA_NAME [TEXT], ROLE_NAME [TEXT], USER_NAME [TEXT],
QUERY_TYPE [TEXT], SESSION_ID [NUMBER], EXECUTION_STATUS [TEXT],
ERROR_CODE [NUMBER], ERROR_MESSAGE [TEXT], START_TIME [TIMESTAMP_LTZ],
END_TIME [TIMESTAMP_LTZ], TOTAL_ELAPSED_TIME [NUMBER], BYTES_SCANNED [NUMBER],
BYTES_WRITTEN [NUMBER], ROWS_PRODUCED [NUMBER], COMPILATION_TIME [NUMBER],
EXECUTION_TIME [NUMBER], QUEUED_PROVISIONING_TIME [NUMBER],
QUEUED_REPAIR_TIME [NUMBER], QUEUED_OVERLOAD_TIME [NUMBER],
TRANSACTIONS [ARRAY], WAREHOUSE_NAME [TEXT], WAREHOUSE_SIZE [TEXT],
CLUSTER_NUMBER [NUMBER], QUERY_TAG [TEXT]
| USERS :
USER_ID [NUMBER], NAME [VARCHAR], CREATED_ON [TIMESTAMP_LTZ],
DELETED_ON [TIMESTAMP_LTZ], LOGIN_NAME [VARCHAR], DISPLAY_NAME [VARCHAR],
FIRST_NAME [VARCHAR], LAST_NAME [VARCHAR], EMAIL [VARCHAR],
MUST_CHANGE_PASSWORD [BOOLEAN], HAS_PASSWORD [BOOLEAN], COMMENT [VARCHAR],
DISABLED [VARIANT], SNOWFLAKE_LOCK [VARIANT], DEFAULT_WAREHOUSE [VARCHAR],
DEFAULT_NAMESPACE [VARCHAR], DEFAULT_ROLE [VARCHAR], EXT_AUTHN_DUO [VARIANT],
EXT_AUTHN_UID [VARCHAR], BYPASS_MFA_UNTIL [TIMESTAMP_LTZ],
LAST_SUCCESS_LOGIN [TIMESTAMP_LTZ], EXPIRES_AT [TIMESTAMP_LTZ],
LOCKED_UNTIL_TIME [TIMESTAMP_LTZ], HAS_RSA_PUBLIC_KEY [BOOLEAN],
PASSWORD_LAST_SET_TIME [TIMESTAMP_LTZ], OWNER [VARCHAR],
DEFAULT_SECONDARY_ROLE [VARCHAR]

Please write a single Snowflake SQL statement for this database
that answers the following question:
"I want to know the top 10 users who have executed queries last month"

This step is an instance of Retrieval-Augmented Generation, commonly known as “RAG.” RAG is often associated with search over large text corpuses for question-answering; here, we search a large set of database objects, and use RAG to focus SQL code generation on a small, relevant subset of tables and columns. This retrieval step allows us to run Snowflake Copilot in real-world environments with tens of thousands of tables!

The Key Question: How Many Tables and Columns To Include

Considerations

At this juncture, we faced a key question: which tables (and columns for each table) do we include in the prompt?

On the one extreme, we could stuff as many tables and columns as would fit in the model’s context length. The reasoning is simple: the SQL generation LLM is a powerful model that can figure out which tables and columns to use for its query better than the retriever can — the retriever has much more limited reasoning ability, and is optimized for low-latency fuzzy search over a large number of documents.

There are also, however, arguments to support inserting as few tables and columns as possible:

  • The retriever has access to a whole bunch of signals that it uses for ranking; for example, we have statistics on which tables are most frequently queried, and which columns within those tables are frequently seen in where clauses, group bys and joins. Although it’s a smaller model, these signals could help it filter out things that confuse the larger model.
  • Empirically, we found (unsurprisingly) that the final accuracy of our LLM is best when it is given exactly the tables and columns that it needs. Adding a few unrelated tables and columns can sometimes cause it to incorrectly use those. (Obviously, though, it’s impossible to know the exact set of golden tables and columns when running in production.)
  • We want to save space in the prompt for other potentially useful information we might want to pass to the LLM.
  • Finally, note also that context length optimization is also valuable because a longer context is more computationally expensive, so all else being equal, a smaller prompt is better.

As an illustration of an additional column confusing an LLM, when passed the above prompt, this is what we get. We can see that the model makes a subtle error: it joins on u.LOGIN_NAME and not u.NAME.

SELECT
u.NAME,
COUNT(q.QUERY_ID) AS query_count
FROM
QUERY_HISTORY q
JOIN
USERS u ON q.USER_NAME = u.LOGIN_NAME
WHERE
q.START_TIME >= DATEADD(MONTH, -1, CURRENT_TIMESTAMP())
AND q.START_TIME < CURRENT_TIMESTAMP()
GROUP BY
u.NAME
ORDER BY
query_count DESC
LIMIT
10;

What We Found

As with most work involving LLMs, the answers are very empirical. We heavily leveraged our evaluation set to answer these questions. Here’s what we found:

  • Being able to rank columns is extremely important, at least for the sorts of tables typically in Snowflake. It is insufficient to just inject all columns into the prompt for the top-k tables.
  • In general, with our real-world retriever, we found that the more tables, and the more columns per table, we included in the prompt, the better the final SQL generation performance.
  • There were diminishing returns once we went above 10 tables and 10 columns per table.
Evaluation of Snowflake SQL code generation on our proprietary test sets designed to model the true complexity of real-world enterprise database schema. “Execution Accuracy” measures resultset overlap relative to human generated SQL. “Eval Score” uses an LLM-as-judge approach to compare SQL queries.

Note that these results depend heavily on the specific performance characteristic of the retriever and the generative LLM. We will continue to lean heavily on our eval set as both the LLM and retriever improve (while also continuously improving the evaluation set itself, to make it as representative as possible) to make sure we set these parameters optimally.

Conclusion and Future Work

To build an industry-leading SQL assistant, the key is to give our LLM the right context, and make it useful.

Snowflake Copilot is still in the early stages of development. We’re experimenting with a bunch more things, ranging from simple tweaks to more significant changes. Some examples are:

  • Dynamically choosing the number of tables and columns using raw search scores
  • Explicitly inserting metadata into the prompt so the LLM knows which tables and columns are frequently seen in join, where and group by clauses.
  • Letting the LLM use its reasoning capabilities to control retrieval iteratively.

You can learn more about Snowflake Copilot, part of Snowflake Cortex, here.

Acknowledgments

This was joint work with Ashwin Devaraj, Macy Ngan and Kelvin So.

--

--