“Snowflake: Revolutionizing Data Warehousing with a Hybrid Shared Data Architecture”

Shreya Mewada
AnalyticsHere
Published in
5 min readJul 1, 2023

Snowflake’s hybrid shared data architecture revolutionizes data warehousing by combining decoupled computing and storage models, enabling independent scalability of compute and storage resources. This multi-cluster shared data architecture enables simultaneous access to the same data, leveraging cloud-native features’ elasticity and scalability. Snowflake’s architecture ensures consistency, eliminates data replication or synchronization, and delivers optimal performance and cost efficiency.

Let's dive into traditional data warehouse types:

Traditional Data Warehouse Types:

Shared Disk Architecture (SDA) is a design approach where multiple nodes in a distributed system share access to a common storage resource. In this architecture, all the nodes are connected to a shared disk subsystem, such as a Storage Area Network (SAN).

Advantages of Shared Disk Architecture:

  • Simplified data management: Since all nodes share the same data, there is no need to replicate or synchronize data across nodes.
  • High scalability: Additional nodes can be added easily by connecting them to the shared disk subsystem.

Disadvantages of Shared Disk Architecture:

  • Potential single point of failure: If the shared disk subsystem fails, the entire system may become unavailable.
  • It cannot scale beyond one point.

Shared Nothing Architecture (SNA) is a distributed or partitioned architecture for designing distributed systems.

  • Each node in the system has its own dedicated resources, including storage.
  • Nodes operate independently and do not share any resources or memory.
  • Data is partitioned and distributed across multiple nodes.
  • Each node is responsible for managing its portion of the data.

Advantages:

  • Data partitioning: Data is divided into subsets, and each node holds and manages a distinct subset.
  • Local control: Each node has autonomy over its data and resources and can operate independently.

Disadvantages of Shared Nothing Architecture (SNA):

  • Fault isolation: Failures in one node do not affect the availability of other nodes since they operate independently.
  • As it scales out, administrative costs increase.

Snowflake Unique Architecture :

Snowflake follows a hybrid approach with decoupled computing and storage.

  • Compute and storage can scale independently in Snowflake’s architecture.
  • It utilizes a multi-cluster shared data architecture.
  • Multiple compute clusters can access the same shared data simultaneously.
  • Snowflake takes full advantage of cloud-native features like elasticity.
  • Compute resources can be scaled automatically based on workload demands.
  • This architecture ensures consistency and eliminates the need for data replication or synchronization across clusters.
  • Snowflake offers optimal performance and cost efficiency.

Snowflake’s unique architecture consists of three key layers:

Database Storage:

  1. Snowflake stores data in databases, which are logical groupings of objects within a Snowflake instance.
  • Objects primarily include tables (permanent, temporary, and transient) and views (standard and materialized).
  • These objects belong to one or more schemas.

2. Snowflake supports structured relational data (standard SQL data types) and semi-structured non-relational data (JSON, Parquet, Avro, ORC, XML) using Variant data types.

3. Snowflake utilizes highly secure cloud storage for structured and semi-structured data.

  • Data is loaded into tables, and Snowflake converts it into an optimized columnar compressed format (proprietary to Snowflake).
  • This format enhances data access efficiency, resulting in faster workloads and lower compute and storage costs.
  • The data is also encrypted using AES 256 strong encryption.

4. Data is loaded into the cloud storage layer (S3/Azure Blob/GCP Bucket) based on the cloud platform.

  • The storage and retrieval details are abstracted from the user, as Snowflake handles the overhead.

5. The compressed and secure data can only be accessed through SQL queries; there are no other means of access.

6. Data storage costs are calculated based on the daily average amount of data in bytes, including short-lived or long-lived tables. If the time travel feature is enabled, it is also considered part of the data storage cost.

Query Processing:

Queries (such as select queries, join queries, data loading, stored procedures, etc.) are run at the compute layer.

Compute machines must be provisioned in Snowflake using the Virtual Warehouse (VWH) function before any query can be run.

The same data store or data layer is accessible to this virtual warehouse (computing resource).

Without any conflict or performance compromise, you may select a virtual warehouse based on the workload needed. You just need to specify a name and size when creating a virtual warehouse (the larger the size, the more computing power it will have), and Snowflake will take care of all the provisioning and setup of the underlying compute resources (for AWS, this is an EC2 instance, and for Azure, this is an Azure VM).

Cloud Services:

The “brain of the snowflake” is another name for the cloud service layer. The system as a whole is coordinated and managed by the cloud service layer.

The compute layer and snowflake make sure that this layer is highly available (redundancy & fault tolerance). This layer is fully independent of the storage layer.

This layer protects and guarantees the following:

· Authorization and Authentication (through WebUT, Connector, SnowSQL, Native Connector, etc.)

· User and session management

· Query compilation, optimization, and data caching,

· Virtualized warehouse management, coordinated data storage/updates, and Transaction management

· Support for metadata management (one of the key activities).

Summary :
· In conclusion, Shared Disk Architecture simplifies data management and offers scalability but has a potential single point of failure.

· Shared Nothing Architecture provides fault isolation but incurs higher administrative costs as it scales out. Snowflake’s unique architecture combines decoupled computing and storage, enabling scalability, consistency, and optimal performance.

· Its three layers—database Storage, Query Processing, and Cloud Services — work together to efficiently store and process data, ensuring security and facilitating centralized management.

--

--

Shreya Mewada
AnalyticsHere

Data Engineer @ FedEx | Building Pipelines |Helping Data 📊 to reach its Target📈