A truly elastic, scalable cloud data warehouse
Snowflake is a cloud Data warehouse offered as Software-as-a-Service(SaaS) on multiple clouds(AWS and Azure) for analytics workload. This is a cloud service similar to AWS Redshift, Google Big Query(sort of). Lately, I have been hearing a lot of good things about this and a lot of companies moving to this cloud data warehouse. Hence I decided to dig a little deeper into it.
A brief overview of the architecture
As you can see there are 3 major components:
- Cloud services: This component takes care of a variety of services like Authentication, Access control, Metadata management, Infrastructure management, Query Parsing, and optimization.
- Query processing: Snowflake has a concept of
virtual warehouseswhich are like a separate MPP cluster which can be instantiated on demand. They come in different sizes X-small to 4X-Large and snowflake charges based on the size of the virtual warehouse. Each virtual warehouse is an independent entity and does not share any compute resources with other virtual warehouses and the performance of 1 virtual warehouse is not affected by others.
- Database storage: Data stored in snowflake is automatically optimized, compressed into a proprietary columnar format and this is stored in Cloud storage like S3 or Azure blob storage. Snowflake manages how it is stored, where it is stored and is only accessible by SQL queries run using snowflake.
What makes it so cool?
Depending on what your requirements are these can vary but overall I found these things as few the many reasons why so many people are moving towards Snowflake.
- No need for any data warehouse maintenance- setting up your cluster, maintaining it, securing it, take backups and all that jazz!
- Get started easily and quickly.
- Pay as you use, no upfront cost.
- Your data warehouse can scale infinitely(almost) both in terms of storage and compute
- You can scale up or scale down based on your usages
But wait, all these already available in AWS redshift or I can get a similar feature in any other cloud data warehouse solutions.
Automatic Query Optimization
There is no dark magic involved in improving the efficiency of your queries. Based on whom you ask this can be considered as a standout feature or a major hindrance but I am not a fan of tuning queries according to my workload as I feel the way data evolves so quickly in organizations. It becomes more tricky to play catch up to turn the all necessary knobs to make the query faster.
Snowflake claims they tune all the queries “automagically” via a dynamic query optimization engine. No need for any indexes, updating statistics, partition keys or pre-shard data for even distribution when you scale up. All of this will be done by their patent-pending dynamic optimization.
But I still feel snowflake can work on providing the necessary knobs for people who would like to tune their queries.
Multi-cloud support for Database Storage
The underlying file system can either be Amazon S3 or Azure storage. This gives the choice to have our Data warehouse in Azure or AWS. But it also utilizes all the storage, IO throughput guarantees already provided by these cloud vendors. All of this data is encrypted, compressed and distributed to optimize performance.
These are typically a cluster of compute resources in Snowflake. These are stateless compute nodes. Since storage is decoupled from compute, the virtual warehouse is a bunch of worker nodes with no state, although they do have cache that store some data to improve query performance but the actual data is never stored in these nodes. Virtual Warehouse essentially is a set of worker nodes that we can pick, choose and scale as per the workload that would just execute the query on demand.
Another big advantage is that multiple Virtual Warehouses can be run on the same data that is stored on the cloud storage.
Auto-suspension and Auto-resumption
Snowflake automatically suspends the warehouse if it is inactive for the specified period of time. This is enabled by default and suspends automatically after 10 minutes of inactivity.
Auto-resumption is the feature that automatically resumes the warehouse when any statement that requires a warehouse is submitted and the warehouse is the current warehouse for the session. This is enabled by default.
You can control these things when you create your virtual warehouse.
CREATE WAREHOUSE MY_DATAWAREHOUSE_001
WITH WAREHOUSE_SIZE = ‘SMALL’
WAREHOUSE_TYPE = ‘STANDARD’
AUTO_SUSPEND = 1800 //600 is default
AUTO_RESUME = TRUE;
There are other interesting things like Micro-partitions & Data Clustering which allow users to tune some queries to work around the fact that there are no indexes or partitions. Also, there is Snowpipe which is Snowflake’s continuous data ingestion service which is an alternative for bulk loads.
We can delve into those aspects some other day.
There are a lot of other cool features in Snowflake that makes it to be considered for your next cloud data warehouse. Also, they offer 400$ credit if you are planning to do some POC.
Thanks for reading! Please do share the article, if you liked it. Any comments or suggestions are welcome! Check out my other articles here.