Writing Snowflake queries in practice
At Conductor, we deal with pretty big data sets and provide users with a rich set of data views, filtering, and aggregations — all in real-time!
is a modern data warehouse that we use to power some of the core features of the Conductor platform — Domain and Page Explorer which analyze millions of keywords per user request. Compared to other tools, Snowflake satisfies our requirements for heavy OLAP and provides a flexible pricing model. Another benefit of Snowflake is that the queries are written with SQL and some Snowflake-specific extensions which developers should be comfortable with.
Every technology has its limitations, pros, and cons. While implementing features for the Conductor platform, we came across a number of tricky Snowflake queries. In the article below I would like to share these insights with those who would like to use Snowflake in their projects. Let’s go!
The examples below will be based on the following table:
For real-world Snowflake tables you should also define a clustering key, but for simplicity I omitted these details.
Now let’s walk through the examples.
Task: find records where keywords contain a target word.
First solution that you might think about could look like this:
The results are obviously not what we need, we get the phrases where ‘real’ is only a part of the word:
It is worth mentioning that Snowflake can do basic data transformations while loading data into the table — for example parse a csv file, recognize comma-separated records as different columns, handle empty values, etc. In our case we already pre-processed keywords splitting them into tokens.
Much better now!
Task: find out which words are the most popular in keywords and count the number of keywords containing these tokens.
The results will look like this:
Task: retrieve all records and calculate total modeled traffic for each keyword using only one query.
- query breakdown on execution steps
- how much time does each step takes
- how many records are returned on each step
- how many data was scanned from disk and from a cache
- how many partitions used — consider this when designing your clustering key
Typically query profiling UI looks like this:
From this screenshot, we can see that 64% of data was scanned from cache which is pretty good. Also, the heaviest operation is JOIN, it takes 50% of execution time — the first thing to consider when trying to optimize the query.
Things that profiling can teach you
- Use clustering keys in JOIN and WHERE clauses. Avoid general fields as much as possible, otherwise your queries will spend the majority of time on table scan operations.
- Forget about SELECT * queries. Snowflake is a columnar data store, explicitly write only the columns you need.
- Apply filters before joins. Use filters in the order of their cardinalities — first filter by columns having a smaller range of different values.
Bonus hint: forget about foreign keys and unique constraints, Snowflake validates only non-null constraints.
Snowflake is a good tool for its class of tasks, it requires minimal maintenance efforts, it supports SQL syntax, has a columnar data structure and clustering keys which should be easy to understand for developers familiar with other solutions like Amazon Redshift. I hope these tips and Snowflake technical documentation will help you pass the pitfalls and write effective elegant queries in your projects!