Writing Snowflake queries in practice

Andrii Loievets
Conductor R&D
Published in
4 min readAug 23, 2021

--

At Conductor, we deal with pretty big data sets and provide users with a rich set of data views, filtering, and aggregations — all in real-time!

is a modern data warehouse that we use to power some of the core features of the Conductor platform — Domain and Page Explorer which analyze millions of keywords per user request. Compared to other tools, Snowflake satisfies our requirements for heavy OLAP and provides a flexible pricing model. Another benefit of Snowflake is that the queries are written with SQL and some Snowflake-specific extensions which developers should be comfortable with.

Every technology has its limitations, pros, and cons. While implementing features for the Conductor platform, we came across a number of tricky Snowflake queries. In the article below I would like to share these insights with those who would like to use Snowflake in their projects. Let’s go!

Sample dataset

The examples below will be based on the following table:

For real-world Snowflake tables you should also define a clustering key, but for simplicity I omitted these details.

Now let’s walk through the examples.

Task: find records where keywords contain a target word.

First solution that you might think about could look like this:

The results are obviously not what we need, we get the phrases where ‘real’ is only a part of the word:

It is worth mentioning that Snowflake can do basic data transformations while loading data into the table — for example parse a csv file, recognize comma-separated records as different columns, handle empty values, etc. In our case we already pre-processed keywords splitting them into tokens.

Much better now!

Flattening arrays

Task: find out which words are the most popular in keywords and count the number of keywords containing these tokens.

The results will look like this:

Window functions

Task: retrieve all records and calculate total modeled traffic for each keyword using only one query.

Query profiling

  • query breakdown on execution steps
  • how much time does each step takes
  • how many records are returned on each step
  • how many data was scanned from disk and from a cache
  • how many partitions used — consider this when designing your clustering key

Typically query profiling UI looks like this:

From this screenshot, we can see that 64% of data was scanned from cache which is pretty good. Also, the heaviest operation is JOIN, it takes 50% of execution time — the first thing to consider when trying to optimize the query.

Things that profiling can teach you

  1. Use clustering keys in JOIN and WHERE clauses. Avoid general fields as much as possible, otherwise your queries will spend the majority of time on table scan operations.
  2. Forget about SELECT * queries. Snowflake is a columnar data store, explicitly write only the columns you need.
  3. Apply filters before joins. Use filters in the order of their cardinalities — first filter by columns having a smaller range of different values.

Bonus hint: forget about foreign keys and unique constraints, Snowflake validates only non-null constraints.

In Conclusion

Snowflake is a good tool for its class of tasks, it requires minimal maintenance efforts, it supports SQL syntax, has a columnar data structure and clustering keys which should be easy to understand for developers familiar with other solutions like Amazon Redshift. I hope these tips and Snowflake technical documentation will help you pass the pitfalls and write effective elegant queries in your projects!

--

--