Snowflake Cortex Analyst: Trust But Verify
As you have probably heard, Snowflake Cortex Analyst is currently available in Preview in most Snowflake regions. This is a breakthrough feature in a lot of ways — it promises to do for structured data (i.e. dimensionally modelled data) what Retrieval Augmented Data (RAG) has done for unstructured data like PDFs: enabling the querying and exploration of data using natural language.
The value of Cortex Analyst lies in its fully-managed nature. Snowflake puts it this way:
Building a production-grade conversational self-service analytics solution requires a service that generates accurate text-to-SQL responses. For most teams, developing such a service that successfully balances accuracy, latency, and costs is a daunting task. Cortex Analyst simplifies this process by providing a fully managed, sophisticated agentic AI system that handles all of these complexities, generating highly accurate text-to-SQL responses.
This sounds very attractive. By now most of us have used LLMs to generate code, but meaningful SQL is very dependent on the underlying data structures that you want to query. An LLM can’t generate meaningful SQL unless it “knows” what the underlying data structures are.
Cortex Analyst fills this “knowledge gap” with its own Semantic Model. This is basically a YAML file that defines the data and relationships in your model in business terms. Once you have established a dimensional data model, it is fairly straightforward to establish a Semantic Model for the dimensional data model that conforms to the model specification. The first time you do this it is probably best to do it by “hand”, using a simplified data model to gain an understanding of how the Semantic Model should be specified. Snowflake has also made available a tool to help with the creation of the Semantic Model. In the future, I suspect that Semantic Model creation will be one of those things that is semi-automated as part of the creation of objects within a Snowflake schema. It really is just an enhanced/annotated version of metadata that already exists.
To test this out, I followed the basic walkthrough here. This creates a Streamlit app you can use to make natural-language queries of your data.
Once I had established the basics and got the project working, I decided to try it out against my own set of data.
Statistics Canada makes a large number of data sets available for free public use on a wide variety of subjects. I chose to use a small subset of data regarding international visitors arriving or returning to Canada by air. This data set categorizes visitors, and counts the number of visitors who arrived at a number of Canadian airports.
An excerpt of the Semantic Model is shown below (the dots represent sections of the YAML I have left out of this excerpt to keep the length manageable):
name: Geo
expr: GEO
description: Geographic location of arrival
data_type: varchar
synonyms: ["arrival location", "port of entry"]
.
.
.
name: Country_of_residence
expr: COUNTRY_OF_RESIDENCE
description: Country traveller arrived from
data_type: varchar
synonyms: ["country"]
.
.
.
name: Traveller_Type
expr: TRAVELLER_TYPE
description: Geographic location of arrival
data_type: varchar
synonyms: ["visitor type"]
.
.
.
measures:
- name: number_of_travellers
expr: number_of_travellers
description: travellers for a date, country of residence, and geo
synonyms: ["arrivals", "visitors"]
Note the use of synonyms. These are alternative definitions of your measures or dimensions that you can use in your natural-language query. In this example, we can use “arrival location” or “port of entry” interchangeably in our natural-language query, to specify the column GEO in our data set.
So let’s try it! After loading my data into the same schema I created during the walkthrough, and running the app, I asked my new data set some basic English language questions:
I started with one that seems pretty basic, but one which actually has an interesting twist:
First, Cortex Analyst provides an English-language interpretation of the question. In other words, a first-level reality-check on what is being asked:
Note the use of the plural airports. Cortex Analyst is able to interpret the question as applying not to “Toronto Airport”, but to any airports with Toronto in the name. This is borne out in the SQL that the enquiry generates:
WITH __stats_can_travellers AS (
SELECT
geo,
date,
number_of_travellers
FROM cortex_analyst_demo.revenue_timeseries.stats_can_travellers
)
SELECT
SUM(number_of_travellers) AS total_arrivals
FROM __stats_can_travellers
WHERE
geo IN ('Toronto - Billy Bishop Toronto City Airport, Ontario', 'Toronto Pearson International Airport, Ontario')
AND date = '2024-11-29'
-- Generated by Cortex Analyst
;
Note the use of the IN clause. Cortex Analyst is smart enough to locate any geo values that have Toronto in them, and use these in the predicate of the query. Here are the results:
Next I asked it to do some basic summaries:
No problem:
WITH __stats_can_travellers AS (
SELECT
geo,
date,
number_of_travellers
FROM cortex_analyst_demo.revenue_timeseries.stats_can_travellers
)
SELECT
geo AS port_of_entry,
SUM(number_of_travellers) AS total_arrivals
FROM __stats_can_travellers
WHERE
date BETWEEN '2024-11-25' AND '2024-11-30'
AND geo IN ('Montreal Trudeau International Airport, Quebec', 'Quebec City Jean Lesage International Airport, Quebec')
GROUP BY
geo
ORDER BY
total_arrivals DESC NULLS LAST
-- Generated by Cortex Analyst
;
Now things got interesting.
I asked what I thought would be a very simple question:
The interpretation of the question looks pretty good:
But the SQL and the results are plainly wrong:
WITH __stats_can_travellers AS (
SELECT
date,
number_of_travellers
FROM cortex_analyst_demo.revenue_timeseries.stats_can_travellers
)
SELECT
AVG(number_of_travellers) AS average_arrivals_per_day
FROM __stats_can_travellers
WHERE
date BETWEEN '2024-11-25' AND '2024-11-30'
-- Generated by Cortex Analyst
;
The data set is small, but there were far more than 125 arrivals per day in this period. We saw from a previous example there were 3,203 in Toronto alone on a single day!
Cortex Analyst seems to be trying to find an average by averaging the number of travelers by the date in the in the stats_can_travellers table without taking into account that the date repeats — there are entries for the same date for multiple geographies.
So what if I explicitly try to force it to calculate an average?
I asked it to explicitly total the number of arrivals between two dates, and then divide that total by the number of days between those two dates. Now I get a correct answer.
But some interesting things happen on the way.
First, I get an interpretation of the question that explicitly recognizes I am trying to get an average, even though I have not used “average” in my request:
Note that this is the same interpretation as my first attempt! But in this case, Cortex Analyst generates completely different SQL:
WITH __stats_can_travellers AS (
SELECT
date,
number_of_travellers
FROM cortex_analyst_demo.revenue_timeseries.stats_can_travellers
)
SELECT
SUM(number_of_travellers) / NULLIF(COUNT(DISTINCT date), 0) AS average_arrivals_per_day
FROM __stats_can_travellers
WHERE
date BETWEEN '2024-11-25' AND '2024-11-30'
-- Generated by Cortex Analyst
;
Even the column name is given the same name as my previous attempt:
But now the answer is correct.
Cortex Analyst is still a relatively new product and I think it shows amazing promise. Almost certainly it will get greater integration into the Snowflake user experience. I also suspect that much of the management of metadata will become more greatly automated.
But even so, care and judgement will need to be exercised before we can always depend on the answers it generates. Certainly for now the first step when using this will be to do a “reality check” on the SQL that is generated.
That said, I am really looking forward to spending more time with this product, as I can see how it can materially lessen the workload on analysts, and potentially make even sophisticated queries available through natural-language query.