Snowflake: a gentle introduction to its data warehouse and data lake offerings

Published in

Data Reply IT | DataTech

9 min readSep 30, 2021

This post is about giving you a quick understating of Snowflake and help you decide if it can be a good solution for your data architecture.

Firsts things first, what is Snowflake?

Snowflake is primarily a data warehouse hosted in a cloud environment (AWS, Azure, and GCP flavors available). It’s fully managed and it has unlimited scalability. This article will explore also why you should consider using Snowflake also as your data lake.

What can it do as a data warehouse? Quite a lot actually, here are the main features:

You can query your data with SQL-like syntax. Json support is particularly good. Snowflake uses a special data type called variant (e.g v:time::timestamp to access nested timestamp value in a json object)
Can define JavaScript User-Defined Functions (UDFs) and Stored Procedures to query your data. Recently, Java UDFs have been introduced.
Joins, acid transactions, analytical aggregations, windowing, hierarchical queries are all supported.
It is possible to run time-travel queries, i.e.: restore the table state at a given time, so if something goes bad you can easily restore the original data. You can also audit the table changes.
Native support for loading data in near-real-time fashion (using Snowpipe)
Highly available and scalable

There are a lot of data warehousing solutions, why Snowflake?

Snowflake shines when it comes to simplicity to set up and management while offering high performances and a ton of features.

Through its intuitive Web UI, it is possible to register, spin up a cluster, load your data and perform analysis in a matter of minutes.

Snowflake Web UI showing steps of the Snowflake tutorial

To load data all you need to do is just spin up a SnowFlake Virtual Warehouse (VW), with a few clicks and minimal configuration, and then load data into Snowflake by issuing a copy statement.

Loading data from S3 bucket into SnowFlake:
create stage nyc_weather
url = ‘s3://snowflake-workshop-lab/weather-nyc’;
copy into json_weather_data from @nyc_weather file_format = (type=json)

You do not need to worry about everything else: indexing, performance tuning, partitioning, physical storage design, compression… it is all done for you by Snowflake.

SnowFlake also shines when it comes to integrations, you can interact with SnowFlake data using a wide range of data tools: BI tools (Tableau, Spotfire…), data tools (Kafka, Spark, Databricks…), JDBC/OBDC drivers, other data integrations tools (Informatica, Talend…).

Would you like to build a ML model on your Snowflake data with Databricks? No problem, from your Databricks notebook you can easily read from a Snowflake table, do your processing and write back into a Snowflake table (https://docs.databricks.com/data/data-sources/snowflake.html)

In Snowflake there’s no need to provision a cluster with enough shared resources for both storage and compute, risking over/under-provisioning for one of the two. Here, you can scale storage and compute resources separately, and consequently also their cost. You can also minimize costs by having workloads using tailored VWs. Then, if your data increases and you need a bigger VW you can simply change the VW or enable the autoscaling feature.

You pay just for what you use: you are billed for the SnowFlake VW cost just when you are running queries. Snowflake can detect when your VW is idle and will auto suspend it, avoiding unnecessary costs. You can also break down the bill for each VW, so if each group uses its set of VW you can easily find out how much each group costs.

https://www.snowflake.com/blog/how-usage-based-pricing-delivers-a-budget-friendly-cloud-data-warehouse/

Another key offering of Snowflake is its support for reference data sharing. The traditional data share model relies on data copy, and it comes with some important disadvantages:

lose control over data once it has been exported
cost of storing copies of the data
cost of duplicating the data and keeping data up to date
no direct way to monitor how the data is being used

Snowflake adopts another approach: it shares data by reference. This means that there’s no need to copy data. You can decide what to share and keep a fine-grained access control over the data (you can also revoke permissions whenever you want). You can share not only tables and views but also UDFs, and joins. Shared data is real-time and immediately available across multiple regions and clouds.

https://resources.snowflake.com/ebooks/oreilly-report-architecting-data-intensive-saas-applications

To help customers understand how their data is being used Snowflake offers a Data sharing usage view that provides information on queries consumers have run. By leveraging these views, Snowflake customers can understand better what their consumers want and adjust their offering.

Even more, through the Snowflake Data Marketplace data providers can sell/buy access to their data. For example, if your processing needs to use a specific data set that other organizations have produced, you can buy it from the Data Marketplace.

How does it work behind the curtains?

When using SnowFlake with AWS (Azure and GCP being the other options) the SnowFlake data on the Storage Layer is hosted in S3. The VW clusters you create to analyze and load the data are made of EC2 instances.

Having S3 as the backbone means high availability and durability but also high I/O latency, overwrite-only support, and CPU overhead. This is where Snowflake comes into play. Its data pre-processing involves a series of steps that will enable Snowflake functionalities over S3-backed data.

Let’s take a look at how Snowflake pre-processes the data:

The table related data is divided into micro-partitions
Each micro-partition is reorganized to be columnar (thus allowing for columnar compression)
A header is added to each micro-partition with column offset and metadata offset
Each micro-partition is an immutable file in S3. Metadata information for each partition is kept by SnowFlake.
By leveraging S3 capability to read just part of files, Snowflake can read just the column you are interested in (column pruning).

At a higher level Snowflake architecture looks like this:

On top of the data, Snowflake has a service layer that stores the metadata and provides the core Snowflake features, handling security, transactions…

We have then the Virtual Warehouses that are the actual engine of Snowflake, DML, and DDF queries are executed by the VMs. One core feature to note is workload isolation: queries executed in a VM do not impact other VMs. As a result, one customer workload will not impact another in a multi-tenant architecture.

At the bottom we have the storage layer, composed of Snowflake pre-processed e compressed data stored in S3.

Using Snowflake as a Datalake

Since Snowflake is backed by S3 (if using AWS) — that is a popular Data Lake solution — the next step is then straightforward— what if I would like to load all my data directly into Snowflake?

Let’s look at some of the advantages that we may benefit from using Snowflake as our datalake:

Snowflake supports ingesting data from a variety of sources. Examples may include: auto-ingest new data with Snowpipes, ingest directly from Kafka through the Snowflake Kafka Connector, generic batch loads (copy statement), or a utility provided by one of Snowflake’s ETL partners.
Can query data directly with Snowflake SQL, without for example the need to resort to Athena/Presto. Moreover, Snowflake SQL has better support for querying Raw data.
It would be way easier for analysts to access the data in Snowflake with respect to the S3 data thanks to the Snowflake BI integrations. Moreover, Snowflake provides its visualization UI (Snowsight).
You can use Snowflake to manage in a centralized way the authorizations for all your data.
Data in Snowflake can be stored in columnar format and compressed, performances are way higher with respect to traditional data lakes.

One legacy/common design solution for data lakes involves dividing the data lake into logical zones as shown in the following picture.

https://jeremiahhansen.medium.com/beyond-modern-data-architecture-89c8199ac4f9

The design pattern often involves having the first two data zones (raw, conformed) stored in a data lake, and the latter two in a data warehouse. Maybe one of the most important advantages of Snowflake is the fact that you are using a unique platform to store all your layers (since it can act both as a data lake and data warehouse).

Snowflake allows you to define data pipelines through streams and tasks, to automatically process your data as it arrives in the raw area. Streams track the changes of a table, while Tasks receive as input the changes and perform some operations. For example, a stream is defined over a source table on the raw layer and it keeps track of the table changes. Every few minutes, a Snowflake task reads from the stream and updates accordingly the table in the conformed layer.

This way, you can update your conformed tables as soon as the data arrives, in an incremental and automated way, all done inside Snowflake. To make things even more interesting you can also perform data elaboration/processing with the recently announced Snowpark, an Apache Spark-like library.

The major drawback in Snowflake datalake offering may be the lack of flexibility for all uses cases, specifically:

Continuous ingestion costs can be high. Creating a Snowpipe stage and using a VW running 24/7 even with a small load can have higher costs — Snowflake charges for VW size, not for resource utilization.
Born as a data warehouse Snowflake best supports structured and semi-structured data. Despite having recently announced incoming support for unstructured data, if have a lot of unstructured data Snowflake may not be the most adequate choice.
Vendor lock-in. Snowflake has a proprietary format. This provides a lot of advantages but also a greater ingestion cost and additional complexity should you decide to move data out of Snowflake. Competitors, like Databricks, tried to mitigate the problem by making their Delta data format open source (even though they are keeping the format best features still private).

Conclusions

Snowflake’s data warehouse offering is a well-established one. It offers a wide range of features, is easy to set up, and has an intuitive Web UI. However, the UI has still work to be done (autocomplete being one of the most requested features) and a desktop app would be a great addition to its future.

A lot of integrations are available, as you would expect from a solid product. One could argue that some of its features have driven changes in other competitors (see Redshift RA for separation of compute/storage).

Paying for what you use is often seen as an advantage. But keep in mind that if you need to have a data warehouse that is always running competitors, like Redshift, can offer highly discounted prices. Also in Snowflake if you need a non-standard VW for specific workloads the cost will increase substantially.

Snowflake offers different packages. Each package has a different cost model with the more expensive ones providing higher security levels. On one hand, you won’t ever have to worry about security, but on the other hand, you may be forced to buy the most expensive package if your business requires it. In other warehouse solutions, you may need to spend a little more time configuring security but you won’t be bound to the pricing model so tightly.

The datalake offering is tempting, as it allows you to merge your data lake and your data warehouse in a single package. The advantages are so evident that other competitors are aiming in the same direction (see Databrick’s Lakehouse). Snowflake as Datalake is definitely an option you want to consider, especially if you are already using it as a data warehouse. It’s becoming increasingly popular, but you need to fit their use cases.