What’s fueling the triple-digit growth of the Snowflake data platform?

Farhan Siddiqui
5 min readFeb 9, 2022

--

This blog post aims to highlight technological underpinnings behind Snowflake’s popularity that are fueling triple-digit revenue growth since its IPO in 2020.

Data lakes aim to provide ready access to all data (structured, semi-structured, and unstructured) for analytics at any scale. Snowflake — a cloud-native, managed analytics database — is uniquely qualified to act as a data platform for structured and semi-structured data in a data lake.

Figure 1: Data platforms recommendations for a Data Lake

Technological Underpinnings

The key technological underpinnings that make it suitable as a data platform for data lake include:

1. Lower TCO and virtually unlimited scale achieved by separating storage and compute

Snowflake uses object stores like S3 to provide cheap and virtually unlimited storage at a fraction of the cost of typical NAS or SAN solutions used in commercial data warehouse appliances.

On compute side, Snowflake uses on-demand cloud-hosted virtual servers, appropriately scaled for analysis at hand, that pop in and out of existence in minutes. Snowflake has an “auto suspend” feature that shuts down a warehouse after a specified interval of inactivity and an “auto resume” feature that restarts the warehouse on the next query.

Given that compute costs accounts for over 80% of the spend in a typical cloud-native, managed analytics database, features like “auto suspend” and “auto resume” are critical to minimizing Total Cost of Owenership (TCO).

Decoupling storage and compute allows Snowflake to bring the right amount of on-demand compute to bear on any-scale data that needs to be analyzed. This significantly reduces the overall cost of data analytics solutions.

Figure 2: Store everything you care about, query what you need, and pay for what you use

2. Blazing fast performance achieved using an architecture that minimizes disk IO

The speed of data analysis is directly dependent on the speed of computer input-output (IO) operations. IO operations typically involve memory, network, and disk, with memory IO being the fastest and disk IO being the slowest. Below are some of the latency numbers for computers built circa 2020:

Figure 3: Read latency for 1 GB of data

Snowflake uses columnar storage, coupled with partitioning and clustering, to significantly prune disk IO thereby improving performance

Figure 4: IO pruning using columnar storage with partitioning and clustering

Snowflake’s cloud services layer also helps eliminate IO by providing metadata and cached resultsets (when available) without having to query the underlying database.

Figure 5: Snowflake architecture

3. New and innovative features

Snowflake enables many new and innovative features like time travel, zero-copy cloning, cross-account data sharing, and multi-cluster warehouses by leveraging the underlying cloud architecture to its fullest.

Time Travel: Snowflake’s underlying storage platforms (object stores like S3), though cheap and scalable, are immutable. Snowflake uses the immutability of these platforms to enable “time travel,” i.e., the ability to query the state of a database as it existed at a specific point in time.

Zero-copy cloning: Zero copy cloning leverages the underlying immutability of object stores to point a new database to a “snapshot in time” of an existing database. This happens immediately without incurring additional storage costs, as the only thing that gets created when cloning a database is a new pointer to a specific snapshot.

Cross account data sharing: Separation of storage and compute makes it trivial to share data across accounts within the enterprise and vendor partners.

This capability allows vendor partners to provide value-added data enrichment capabilities at a much lower cost than inhouse processing by enterprises, as vendor partners can defray the cost of data enrichment over many customers. One prime example of such value addition in the pharmaceutical industry is prescription data enrichment capabilities provided by vendors like IQVIA.

Multi-cluster warehouses: Concurrency problems are pretty common in any distributed columnar database as the node hosting the most popular data eventually becomes a bottleneck as the number of concurrent queries increases. Snowflakes sidestep this problem by allowing many warehouses to stream source data from the same underlying object store and share concurrency load across multiple warehouses.

Figure 6: Multi-cluster warehouse

4. Analyst skill gap: As datasets continue to grow in size, it has become challenging to hire data analysts that have the right technical expertise to deal with these large datasets. There are many languages and frameworks that help process big data, including Hadoop, MapReduce, Pig Latin, Spark, Drill, Hive, Impala, Scala, SAS, R, Python, etc. However, Structured Query Language (SQL) remains the baseline skill across the data analyst community.

Snowflake allows an average data analysts with a decent SQL knowledge to process big data with ease.

Other Considerations:

Cost: Ease of scalability of Snowflake as a data platform comes with the hidden risk of unbounded cost explosion in the absence of effective governance and periodic pruning of unnecessary databases and warehouses.

Snowflake is better suited for intermittent use cases while “reserved instance AWS Redshift-RA3” like solutions have a more favorable cost profile for continuous use cases.

Security: Snowflake is a cloud-hosted data platform. By its nature, it increases the attack surface area for malicious access since it is accessible outside the corporate intranet.

Built-in capabilities like IP whitelisting and privatelinks can help mitigate security risks.

Competitors: AWS Redshift-RA3, Google BigQuery, and Azure Synapse Analytics are some of the other cloud-native, managed analytics databases that may be worth exploring for your use case.

Image credits: Figure 5 & 6 — Snowflake documentation

Disclaimer: This is a personal blog post. The opinions expressed here represent my own and not my current or former employers. All content is provided for educational purposes only, without any warranty of suitability.

--

--

Farhan Siddiqui

Technology leader with over 20 years of experience delivering analytics solutions for fortune 100 companies.