What Makes Snowflake So Powerful — It’s the Hybrid of Shared Disk and Shared Nothing Architecture

Arun Kaushik
6 min readFeb 14, 2020

--

The IT industry has been undergoing a massive transformation for last few years. Companies have been heavily investing in upgrading their technology stack to reduce operational costs and offer a better service to customers. Digital, Cloud, As-a-Service etc are some of the buzz words which have been driving transformation projects across the industry.

Being a market leader in cross border money transfer for so many years, my company is upgrading its technology stack by focusing more on digital and cloud solutions which are surely going to give amazing returns on its investment. This article mainly focuses on a data warehouse which we have adopted to migrate from legacy on premise data warehouse to cloud data warehouse i.e Snowflake.

What is Snowflake?

Snowflake is a cloud data warehouse offered as Software-as-a-Service (SaaS). It is faster, flexible and easier to use than legacy data warehouses like Netezza, Teradata etc. Being a SaaS, Snowflake frees its users from installing, configuring or managing the following.

Hardware

Software

Operational Maintenance

Snowflake runs completely on public cloud infrastructure. It cannot run on private cloud infrastructure. It uses AWS EC2 (if deployed on AWS) instances to provide compute capabilities and offer them as virtual warehouses. For persistent storage of data, it uses AWS S3 (if deployed on AWS).

Snowflake follows a hybrid of shared disk and shared nothing architecture. Before diving into Snowflake architecture, let’s understand the different parallel hardware architectures utilized by database systems — shared memory, shared disk and shared nothing.

Shared Memory:

In a shared memory architecture, all the processors access and share the same RAM and disk and give nearly the same performance.

Shared Memory Architecture

Shared Nothing:

Database systems with a shared nothing architecture are made up of cluster of independent machines or nodes connected with each other over a high-speed network. Each machine or node has its own RAM and disk. All the database tables are spread over multiple machines in the cluster using horizontal data partitioning so that the processors can run independently of others. This means that each node stores only a portion of whole data in its disk.

Whenever a shared nothing cluster receives a client SQL request, the query is sent to every node or machine of the cluster and they in turn execute it against the portion of data they store. This parallel execution of query on all machines or nodes of the cluster results in faster data processing.

Shared Nothing Architecture

Databases using shared nothing architecture assign each tuple to an individual machine hence resulting into horizontal partitioning of data tables across all the machines of a cluster. Techniques like hash-based partitioning, range based partitioning or round robin partitioning can be applied to partition the tables across the cluster.

It is the responsibility of the DBA to apply the best partitioning technique to reap maximum benefits of the shared nothing architecture. Applying a poor or inappropriate partitioning technique will lead to unsatisfactory performance.

Shared nothing systems are highly scalable and can be used for applications that require high degree of concurrency. If the business goal is to build a system like Google.com or Amazon.com where you will be required to concurrently cater to millions of customers, then shared nothing systems will be an ideal choice. Though shared nothing systems are scalable and offer high degree of concurrency, they come with their own challenges and limitations.

Failure of even a single machine or node in the cluster can change the overall behavior of database. This is because data is horizontally spread across all nodes of the cluster and each cluster processes only the data allocated to it, so failure of a node can lead to data incompleteness at the end of data processing.

Shared Disk:

In a shared disk system, all the processors can access and share a common disk but are unable to access each other’s RAM.

Databases using shared disk systems need less interference from the DBA because they do not have to consider partitioning of tables across the nodes to achieve parallelism. Very large size databases will still need to partition the tables.

Shared Disk Architecture

Failure of a single node will not affect the processing of data at other nodes. They will still be able to process the entire data stored on disk.

Despite of many advantages like low administration cost, immunity from single node failure, shared disk systems have their own challenges and limitations. Lock table contention is the most known limitation of shared disk systems.

Each node of a shared disk system contains cached local copies of data residing in disk. A distributed lock table is used to manage and ensure consistency of data among the nodes.

Writing data in a shared disk system

Let’s say we have 2 nodes in our shared disk system and the disk contains a record with primary key = 123 and value = ‘Arun’. Both nodes have cached local copies of this record in memory. If a client wants to update this record by changing the value from ‘Arun’ to ‘Kaushik’ then to maintain data consistency across all the nodes, the database must take a distributed lock on all the nodes that might have cached this record. With the increase in number of nodes or machines in the cluster, taking a distributed lock becomes slower and lead to contention of the lock table. This results into slow down of the writing process.

Snowflake Architecture:

Snowflake’s architecture is a hybrid of shared disk and shared nothing architectures. Snowflake has divided its architecture in 3 layers as given below.

Cloud Services

Query Processing

Database Storage

Separation of compute (query processing) from storage (Database Storage) is the beauty and power of Snowflake. This separation makes it one of the most powerful data warehouses offered as a cloud service.

Snowflake Architecture

Cloud Services:

The cloud services layer is a collection of services that coordinate activities across Snowflake. These services tie together all the different components of Snowflake in order to process user requests, from login to query dispatch. The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider.

Among the services in this layer:

Authentication

Infrastructure management

Metadata management

Query parsing and optimization

Access control

Query Processing:

The query processing layer takes advantage of the shared nothing architecture. Query execution takes place inside this layer. Queries are processed using virtual warehouses. Each virtual warehouse is an MPP (Massive Parallel Processing)compute cluster composed of many compute nodes allocated by Snowflake from a cloud provider. These compute clusters are basically AWS EC2 instances if the cloud provider is AWS.

Each virtual warehouse is an independent compute cluster and does not share compute resources with other clusters or virtual warehouses. That is why performance of one virtual warehouse does not impact the performance of other virtual warehouse.

Database Storage:

When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format. Snowflake stores this optimized data in cloud storage. Shared disk architecture is utilized in this layer.

Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake.

Though Snowflake is one of the best data warehousing solutions available in the market that does not mean it will fit in every business use case. Companies should do a thorough research by comparing it with other cloud offerings like Redshift, Cloudera Altus before making a final decision to onboard Snowflake. Research parameters could be cost, query concurrency, administration activities etc. Before making a buying decision, we also did a Proof of Concept (POC) and compared Snowflake, Redshift and Cloudera Altus and found Snowflake to be the most suitable solution for our business use case.

References:

www.snowflake.com

Snowflake Documentation

Dsf.berkeley.edu

www.benstopford.com

--

--