Experience with Snowflake as a data warehouse

Aseem Bansal
towards-infinity
Published in
4 min readJul 2, 2018
Logo: Snowflake.net

We chose Snowflake as our data warehouse around 3 months ago. During these 3 months we have been using it in our team. Earlier the workflow was this. Someone wanted some insight from the data. They created a ticket and asked me to get that for them. The joke became part of our newsletter too.

What if you were a data scientist, with access to some of the richest data imaginable, but you had to ask Aseem to get it for you every time you wanted to look?

Since then that has changed. Now people don’t have to come to me for getting insights from the data. They ask for help sometimes with SQL but I am not the sole person who is getting insights from data for everyone. It is looking to me like the decision for choosing Snowflake as our data warehouse was the correct one. Engineering not being in the middle of data was one of my main goals when choosing a data warehouse.

If you want to check how I chose our data warehouse look at this earlier post that I had done earlier.

I would like to share what was the difference between the before and after. What made the after easier. And some things that have been a somewhat mixed bag.

Before Snowflake I used Apache Spark for getting information from the data. I spun up a EMR cluster, connected to the zeppelin notebook and started playing around with the data. After I got the answers I gave them back. Now anyone with an account logs into snowflake’s UI, writes SQL and get the answers they are looking for. People who work on front end use it too to get answers. Sometimes product people try it out. None of these people were using Apache spark earlier. That was too cumbersome. Why? Because they had to worry about machines. About the exceptions that Spark gave us.

Another thing that I am noticing is that using Snowflake to make data based decisions. It goes like this. Someone says “Let’s do X.”. Then someone says “Why don’t we check the data in Snowflake before we decide whether this is a good idea or not?”. Someone goes into Snowflake and checks the data. Then the team is better informed to make that decision. That almost never happened earlier. A bit more data driven decisions now.

The main thing that helped with this change was that we don’t have to worry about machines. We still have ETL and ML model training jobs written in Apache Spark. When we wrote those jobs we had to decide machine configurations. What if we want to process more data? Try different machine configurations. What if that new machine configuration fails? Try another. Now I am not saying we cannot do it. But it takes some time. And having a much bigger cluster running is costly. What about with Snowflake? We have created 4 warehouses in Snowflake . 2 are for automated jobs that run on daily basis. One for small ad-hoc analysis. One for heavy data requirements. We have 4 for budgeting reasons. And if someone has to do a bigger work? People increase the warehouse size and decrease it back after the work is complete. Quite simple.

Another thing that was unique in Snowflake was its Variant data type. Our schema was not fixed. We have integrations with 3rd parties. They are sending us data. Some of it has fixed schema. Some of it does not fixed schema. Sometimes they do mistakes and go against the schema. We need to be able to store and analyze the data. Variant data type has been helpful for that. But this is one of the places where we had some issues too

  • Snowflake queries don’t have to scan all the data. Due to the metadata. That makes them fast and cheap. In case of variant data type it was doing full table scans earlier. That made queries costly and slow. If someone wanted to look at data for a day or an hour using where clauses filtered the data. But did not help with the query execution time and thus the cost. So I created columns containing date and time which make filtering the data cheaper. Seems like for variant column there are limitations in metadata used.
  • With variant data type when we were using limit clause with where clauses it does a full table scan. In some queries it did full table scan, in some queries it did. I am not sure what exactly caused that. We started using with clauses to bypass that problem. These with clauses had limit clauses but not where clauses over the variant data type.
  • Variant data type are schema-less. That is great till someone wants to get to know what the schema looks like. One way to get some idea about the varying schema is to run some select queries with limits and look at some rows. Another way is to use spark to read the data and use printSchema() over the spark dataframe. It is very rare need to know the complete schema. So that is ok. It is not ideal though. I wish there was something like variety for mongo which can give us the schema. As Snowflake is SAAS I hope they add something like that.

All in all I would say that it has been a positive experience with Snowflake. If you reached till this point do share whether there was anything that you liked about Snowflake. I would love to hear your experiences.

Created by Aseem Bansal. If you want to know when I write more articles or send me a message click here for the details

--

--