Testing query optimization tips in Snowflake
Let’s see how Snowflake handles some easy examples
I’ve heard and read about several “tips” when it comes to writing queries in a columnar databases like Snowflake or BigQuery.
Some tips are for readability, or overall code quality, some relate to query optimization and how columnar databases shuffle data.
I wanted to test a few of these optimization techniques to see how different databases’ query optimizers handle them. These are very simple toy examples — the results could differ given a more complex query.
In this post, I’ll show you the results I got with Snowflake.
In the next post, we’ll look at how Google’s BigQuery handles similar tasks.
These examples use public municipality data available in Finland. The data represents all purchases made by a city in a given time frame. The dataset holds roughly 1.4M rows with 60ish columns.
I also tell Snowflake to not use cache results — that would defeat the whole purpose of this exercise.
In all of the examples we do roughly the same thing: we take the average value of a purchase over a few columns. Then we join the average back to the original data and check whether the value of that row is bigger than the average of the group.
We need to do some casting because Finnish numbers use a decimal comma — not a decimal point. I have also changed the names of the columns from Finnish to English to make the examples a bit more easier to understand.
CTEs vs. subqueries
I love CTEs. They make complex SQL so much more readable than an endless chain of nested subqueries.
There’s even a really popular blog post that seems to indicate that CTEs are more performant than subqueries. Even GitLab refers to this blog post in their SQL Style Guide while advocating the use of CTEs.
Naturally, this will be our first test.
A single CTE and subquery
The first version utilizes a single CTE to calculate the average:
WITH average AS (
SELECT
MUNICIPALITY_ID,
AVG(CAST(REPLACE(EUR, ',', '.') AS NUMERIC)) AS AVG_EUR
FROM
F_PURCHASES
GROUP BY
1
)
SELECT
A.RECEIPT_ID,
A.MUNICIPALITY_ID,
IFF(
CAST(REPLACE(A.EUR, ',', '.') AS NUMERIC) > B.AVG_EUR,
TRUE,
FALSE
) AS IS_ABOVE_AVG
FROM
F_PURCHASES AS A
LEFT JOIN
average AS B
ON
A.MUNICIPALITY_ID = B.MUNICIPALITY_ID;
The second version utilizes a subquery:
SELECT
A.RECEIPT_ID,
A.MUNICIPALITY_ID,
IFF(
CAST(REPLACE(A.EUR, ',', '.') AS NUMERIC) > B.AVG_EUR,
TRUE,
FALSE
) AS IS_ABOVE_AVG
FROM
F_PURCHASES AS A
LEFT JOIN (
SELECT
MUNICIPALITY_ID,
AVG(CAST(REPLACE(EUR, ',', '.') AS NUMERIC)) AS AVG_EUR
FROM
F_PURCHASES
GROUP BY
1
) AS B
ON
A.MUNICIPALITY_ID = B.MUNICIPALITY_ID;
The results are very similar:
| Query | Total Execution Time | Bytes Scanned |
| -------- | -------------------- | ------------- |
| CTE | 577 ms | 5.39 MB |
| Subquery | 565 ms | 5.39 MB |
Both the execution time and the bytes scanned metrics are practically identical. We can conclude that the two versions behave roughly identically.
Multiple CTEs and Subqueries
Let’s make this test a bit more complex and instead of a single CTE and subquery, we have two.
The first version still utilizes a CTE but joins that twice:
WITH average AS (
SELECT
MUNICIPALITY_ID,
AVG(CAST(REPLACE(EUR, ',', '.') AS NUMERIC)) AS AVG_EUR
FROM
F_PURCHASES
GROUP BY
1
)
SELECT
A.RECEIPT_ID,
A.MUNICIPALITY_ID,
IFF(
CAST(REPLACE(A.EUR, ',', '.') AS NUMERIC) > B.AVG_EUR,
TRUE,
FALSE
) AS IS_ABOVE_AVG,
IFF(
CAST(REPLACE(A.EUR, ',', '.') AS NUMERIC) < C.AVG_EUR,
TRUE,
FALSE
) AS IS_UNDER_AVG
FROM
F_PURCHASES AS A
LEFT JOIN
average AS B
ON
A.MUNICIPALITY_ID = B.MUNICIPALITY_ID
LEFT JOIN
average AS C
ON
A.MUNICIPALITY_ID = C.MUNICIPALITY_ID;
The second joins a subquery twice:
SELECT
A.RECEIPT_ID,
A.MUNICIPALITY_ID,
IFF(
CAST(REPLACE(A.EUR, ',', '.') AS NUMERIC) > B.AVG_EUR,
TRUE,
FALSE
) AS IS_ABOVE_AVG,
IFF(
CAST(REPLACE(A.EUR, ',', '.') AS NUMERIC) < C.AVG_EUR,
TRUE,
FALSE
) AS IS_UNDER_AVG
FROM
F_PURCHASES AS A
LEFT JOIN (
SELECT
MUNICIPALITY_ID,
AVG(CAST(REPLACE(EUR, ',', '.') AS NUMERIC)) AS AVG_EUR
FROM
F_PURCHASES
GROUP BY
1
) AS B
ON
A.MUNICIPALITY_ID = B.MUNICIPALITY_ID
LEFT JOIN (
SELECT
MUNICIPALITY_ID,
AVG(CAST(REPLACE(EUR, ',', '.') AS NUMERIC)) AS AVG_EUR
FROM
F_PURCHASES
GROUP BY
1
) AS C
ON
A.MUNICIPALITY_ID = C.MUNICIPALITY_ID;
This time around, we notice some differences in the key metrics:
| Query | Total Execution Time | Bytes Scanned |
| ------------ | -------------------- | ------------- |
| 2 CTEs | 628 ms | 5.39 MB |
| 2 Subqueries | 722 ms | 8.08 MB |
The differences in query times are minimal and probably due to variance. What stands out is the “Bytes Scanned” metric.
The version with two subqueries scans way more bytes. This is most likely because Snowflake recognizes that we are utilizing the same CTE twice and only processes it once whereas it does not recognize that the subqueries are identical.
Snowflake UIs query profile shows that the subquery version has three table scans — the CTE version has only 2. Looking closer to the table scan nodes one notices that the subqueries are in fact processed separately.
SELECT * vs. SELECT some
The conventional wisdom is that due to the nature of how columnar databases store data selecting more columns than are needed to complete the query will be inefficient.
In this test, the first version selects all columns:
WITH more_than_needed AS (
SELECT
*
FROM
F_PURCHASES
),
average AS (
SELECT
MUNICIPALITY_ID,
AVG(CAST(REPLACE(EUR, ',', '.') AS NUMERIC)) AS AVG_EUR
FROM
more_than_needed
GROUP BY
1
)
SELECT
A.RECEIPT_ID,
A.MUNICIPALITY_ID,
IFF(
CAST(REPLACE(A.EUR, ',', '.') AS NUMERIC) > B.AVG_EUR,
TRUE,
FALSE
) AS IS_ABOVE_AVG
FROM
more_than_needed AS A
LEFT JOIN
average AS B
ON
A.MUNICIPALITY_ID = B.MUNICIPALITY_ID;
The second version selects only those columns that are needed:
WITH what_is_needed AS (
SELECT
MUNICIPALITY_ID,
RECEIPT_ID,
EUR
FROM
F_PURCHASES
),
average AS (
SELECT
MUNICIPALITY_ID,
AVG(CAST(REPLACE(EUR, ',', '.') AS NUMERIC)) AS AVG_EUR
FROM
what_is_needed
GROUP BY
1
)
SELECT
A.RECEIPT_ID,
A.MUNICIPALITY_ID,
IFF(
CAST(REPLACE(A.EUR, ',', '.') AS NUMERIC) > B.AVG_EUR,
TRUE,
FALSE
) AS IS_ABOVE_AVG
FROM
what_is_needed AS A
LEFT JOIN
average AS B
ON
A.MUNICIPALITY_ID = B.MUNICIPALITY_ID;
The results are once again practicaly identical:
| Query | Total Execution Time | Bytes Scanned |
| ------------ | -------------------- | ------------- |
| SELECT * | 1500 ms | 2.71 MB |
| SELECT some | 1500 ms | 2.71 MB |
Again, the total execution times are identical. The amount of bytes scanned are also identical suggesting that the two queries do roughly the same.
This is most likely due to the fact that in the final SELECT statement we only take a handful of columns: it is easy for the query optimizer to just skip the unnecessary columns that are included in the CTE.
Of course this does not mean that selecting unnecessary columns does not matter. Processing columns is inefficient in a columnar database. It is also a boost to readability if you explicitly state the columns you are using.
Funnily enough, even though these queries are logically identical to the CTE vs Subquery example, the number of bytes scanned has dropped significantly.
Window-functions everywhere vs. GROUP BY
This last one isn’t exactly anything I have heard a “best practice” about. But I have stumbled across a query similar to this one in my career.
The codebase I was working with had the following query where window functions were used quite interestingly:
WITH BASE_ AS (
SELECT
MUNICIPALITY_ID,
COMPANY_ID,
CAST(REPLACE(EUR, ',', '.') AS NUMERIC) AS EUR,
CAST(
REPLACE(TOTAL_EUR, ',', '.') AS NUMERIC
) AS TOTAL_EUR
FROM
F_PURCHASES
)
SELECT DISTINCT
MUNICIPALITY_ID,
COMPANY_ID,
SUM(EUR) OVER(
PARTITION BY
MUNICIPALITY_ID,
COMPANY_ID
) AS EUR_SUM,
AVG(TOTAL_EUR) OVER(
PARTITION BY
MUNICIPALITY_ID,
COMPANY_ID
) AS TOTAL_EUR_AVG
FROM
BASE_;
I really do not know what’s the story behind this query. There are a lot of things that don’t make much sense to me. Maybe it was written this way to tackle duplicate rows? But even then the calculation would be off, because those window-functions would still contain those duplicates.
Anyway, we were basically utilizing window-functions where a simple “GROUP BY” would yield the same result:
WITH BASE_ AS (
SELECT
MUNICIPALITY_ID,
COMPANY_ID,
CAST(REPLACE(EUR, ',', '.') AS NUMERIC) AS EUR,
CAST(
REPLACE(TOTAL_EUR, ',', '.') AS NUMERIC
) AS TOTAL_EUR
FROM
F_PURCHASES
)
SELECT
MUNICIPALITY_ID,
COMPANY_ID,
SUM(EUR) AS EUR_SUM,
AVG(TOTAL_EUR) AS TOTAL_EUR_AVG
FROM
BASE_
GROUP BY
1, 2;
The metrics show that the performance of the GROUP BY is way better:
| Query | Total Execution Time | Bytes Scanned |
| --------- | -------------------- | ------------- |
| WINDOW | 655 ms | 6.51 MB |
| GROUP BY | 286 ms | 6.51 MB |
The difference in execution time is so noticeable that I wouldn’t pin it on variance.
The real life example this was copied from calculated dozens of metrics making the problem likely worse. As a bonus, using a GROUP BY makes the code a lot more concise and easier to work with.
Key takeaways
This limited test showed some interesting properties that are useful to keep in mind:
- CTEs are not necessarily more efficient than subqueries. I will still take CTEs over subqueries any day because they can be more efficient and they make the code much more readable which is one of the most important aspects of code.
- CTEs are “cached” whereas subqueries are processed as being “unique”.
- With simple queries like these using “SELECT *” won’t negatively affect your query performance.
- GROUP BY beats window-functions in execution speed.
The big asterisk to all of the above points is that the examples were simple enough for Snowflake’s query optimizer to handle them efficiently. This would most likely change in a real work environment where transformations can be a combination of dozens of complex steps over thousands of lines of SQL.
Stay tuned for the next part of the series where I will take BigQuery on a similar journey! In the meanwhile, I would like to hear your experiences in optimizing Snowflake queries.