Snowflake: Building a Data Cloud

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

11 min readMar 20, 2024

Summary

Snowflake is a unified, comprehensive, global and highly available data cloud. Building a data cloud architecture capable of powering a diverse set of data and AI applications is an enormous technical challenge. This blog post presents the most important design elements in Snowflake’s architecture and is the first in a series of posts where we will share in more detail the technical challenges and novel solutions we’ve built to eliminate complexity for users and dramatically expand what they can do with their data.

Evolution of Data Analytics

Over the past two decades, the Data Analytics industry has been transformed by changes in how data is produced, stored and used to make business decisions. In particular, the data analysis problem has evolved from traditional data management on human-generated structured data to complex data and AI applications on machine-generated semi-structured and unstructured data. In the meantime, the scale of both data and analytics workloads has dramatically increased.

With such a big transformation in the problem space, industry solutions have also evolved: data lakes, notebooks and streaming ecosystems have grown to handle more complex unstructured data as well as more complex data engineering and data science applications. Data warehouses have become more efficient and more powerful in dealing with larger volumes of data and workloads. Several domain-specific solutions have started to target use cases such as streaming and real-time processing, observability and telemetry, and geo data applications.

The proliferation of solutions has come with its own challenges: data siloed across systems, cost and complexity of managing data, lack of end-to-end data governance, and limitations in data sharing and collaboration.

Snowflake’s Data Cloud addresses these challenges by offering a unified, comprehensive, global and highly available cloud for data and AI applications. It eliminates silos, provides cohesive data governance, and enables collaboration across datasets and workloads that are traditionally served in isolation. Through that, Snowflake minimizes the time and knowledge needed to make data useful.

Data Complexity

The first step in building a Data Cloud is dealing with more complex data. In particular, we need to support efficient storage and processing of semi-structured and unstructured data as machine-generated and multi-modal data become a more integral part of data analytics and applications. Snowflake supports managing, storing, and processing of unstructured data through directory tables and scoped URLs. Combined with features such as Document AI, this enables users to utilize unstructured data in the same framework as structured and semi-structured data.

Snowflake has supported efficient storage and processing of semi-structured data from day one as part of the query and storage engine design. This ability that involves intelligent encoding, compression and shredding of semi-structured fields is a key requirement for successful handling of semi-structured data. These storage and query processing features make Snowflake a great choice for efficiently storing and processing large-scale machine-generated data such as logs and events.

In addition to semi-structured and unstructured data, Snowflake’s data type frameworks enable native support for specialized data types such as geography and geometry. These data types are natively supported in the execution and storage engines through the same data type framework as semi-structured data, while providing native, domain-specific primitives such as geography and geometry operators and joins.

Efficient ingestion and extraction of various data types described above is another essential requirement for a Data Cloud. Similar to many modern data processing systems, Snowflake offers streaming data in and out of the storage system through Snowpipe Streaming and Snowflake Streams. But a critical part of dealing with data complexity is supporting transformations as the data is being ingested. Snowflake builds a robust streaming transformation infrastructure through dynamic tables that fully integrates with and leverages the Snowflake data flow engine used for query processing. Additionally, Snowflake support for data lakes provides zero-copy ingestion and extraction through features like Iceberg tables. The combination of these two primitives enables efficient integration between modern data processing pipelines and Snowflake, making Snowflake the appropriate platform for both streaming data and open storage formats.

Another key design aspect of Snowflake is how it leverages existing cloud storage infrastructure such as blob storage services on AWS, Azure and GCP. The advantage of using such infrastructure as opposed to building a first-party storage system is two-fold:

The ability to scale and offload many aspects of data delivery and replication and aligning it with other data applications sharing the datasets gives the Data Cloud the flexibility to remain cost-efficient as data ingestion and extractions scale.
This is a key building block for zero-copy data ingestion and extractions, enabling open and efficient integration with other data processing systems.

We apply the principle of storage layering to Hybrid Tables, where Snowflake decouples the transactional storage layer from the query processing layer in order to seamlessly support hybrid transactional and analytical operations such as bulk loading of data and replication across deployments.

Snowflake supports governance, access and discovery of all types of data, models, apps and more through a unified framework called Snowflake Horizon. This governance framework is consistent across structured, semi-structured and unstructured data, making application development and collaboration simple and secure across all data use cases.

Workload Complexity

In order to support a broad set of data applications, we must consider how to represent diverse workloads. Snowflake’s approach has been to pursue two directions in tandem.

Embrace and Expand SQL

While SQL has limitations especially when it comes to representing procedural applications, it is a familiar declarative paradigm to express data flow operations. Additionally, it sets a familiar framework for data and object models and management. Hence, Snowflake has been expanding both the SQL language constructs as well as the data model to enable both data management and flow for new types of workloads.

Examples of this include:

Performing predictive analysis, or LLM-powered transformations through Cortex Functions
Governing and sharing applications and data
Managing and storing geographical data
Defining and managing data governance policies and data quality metrics
Expressing time-series operations such as AS OF join

Going Beyond SQL

Applications today encompass a large set of tools, libraries, notebooks and scripts across many programming languages. It would be impossible for a platform to support such applications purely using SQL. Most SQL-based analytics systems today support extending the language interface through user-defined functions. However, capturing the fast evolving world of applications requires first-class support of frameworks such as Apache Spark, Pandas and PyTorch.

Snowflake today offers Snowpark (including Snowpark Container Services), as well as Streamlit in Snowflake as programming frameworks that enable building applications for data science, AI and data engineering. All of these languages work in tandem with the SQL interface (data model and operations). Many applications can seamlessly cross the SQL and non-SQL worlds. This is enabled by the unified execution engine discussed below.

Unified Processing

One risk of supporting a large ecosystem of programming paradigms and frameworks is a divergence of data processing capabilities, and worse, the siloing of data. Since the beginning, Snowflake built a generalized vectorized data flow processing engine that is capable of executing any data flow operation, beyond just SQL. This model became widely popular with computing models such as Apache Spark and Pandas.

Snowflake’s compute engine serves all data processing operations, SQL and non-SQL, analytical and transactional. It even goes beyond user workloads and powers all background operations and optimizations on data such as automatic clustering, dynamic tables, materialized views and search optimization. An important advantage of having a single, unified engine is that continuous improvements we make to query engine performance (captured by the Snowflake Performance Index) broadly elevate performance across all workloads. Additionally, this enables Snowflake to provide consistent insights into job execution, and support a coherent and easy-to-use compute resource management mode.

Adaptive Execution

An important implication of supporting programming paradigms outside SQL is the introduction of black-box operations to the data flow execution engine. An example of such a black-box operator is a user-defined function written in programming languages such as Python or Java.

In some cases such as ML pipelines, such functions can be extremely CPU or memory intensive operations while in other cases they could be trivial transformations. In some cases, the overhead of executing user-defined functions might significantly vary from one row to another. Since the query optimizer does not have visibility into the cost and overhead of these operators, it can be difficult to optimize the query plan and query optimization time. Instead, it is critical that the execution layer can adapt to the compute and memory overhead of such operators. Local and distributed operators in Snowflake are adaptive by design and can easily deal with characteristics such as processing skew.

Adaptive execution also plays an essential role in serving hybrid transactional and analytical workloads where the same data is represented in several different forms, such as row- and column-oriented formats as well as secondary indices. Static optimizations become intractable in those cases and adaptive execution is the way to go. For example, Snowflake’s adaptive execution engine can decide between using a secondary index versus redistributing data for a join at query runtime by observing the execution pattern.

Data Granularity

In modern data applications, data scale and granularity vary from workload to workload. Some use cases can operate on small datasets (gigabytes) while others require processing petabytes of data. There are also workloads on very large datasets that focus on working sets that are small.

A key design pattern to handle such a large disparity in scales is to micro-partition data and store it decoupled from the compute infrastructure. This pattern, which is now employed by most cloud-based analytics systems, allows independent scaling of compute and storage and provides cost efficiency at scale. However, this pattern is not sufficient to provide good performance across the spectrum.

What is needed is a mechanism to scale metadata operations with the data, so that for small-scale datasets, metadata operations are as cheap and fast as possible, while for large-scale datasets, metadata operations can be parallelized and scaled.

Snowflake employs four techniques to achieve this:

Decoupling of physical metadata from data: Metadata needed for distributed query planning and optimization should be available without the need to access data.
Treating metadata as data: Laying out metadata in a columnar compressed structure that can be processed using the same query engine as the rest of the data; this is critical to scaling out the processing of metadata.
Creating a hierarchy of metadata: Optimizing the physical metadata layout for large-scale processing sometimes leads to poor performance for small datasets where the overhead of processing metadata in a distributed fashion becomes visible. Creating a hierarchy of metadata allows for a graceful and gradual shift from single-node, low-overhead processing of metadata to large-scale distributed processing as the metadata size grows.
Employing secondary data structures and indices: Snowflake employs additional data structures such as Search Optimization indices to provide fast and efficient access to data that is sometimes not possible through the main layout of metadata. In doing that, a key aspect is to maintain the ease-of-use and management of the platform.

Workload Granularity

Another important dimension of supporting diverse applications is to consider how high-QPS (queries per second), high-concurrency operational workloads differ from large-scale analytical workloads and the implications of that on system architecture. We addressed many such challenges while building Snowflake Unistore.

Traditionally, analytical systems work with immutable storage through batch operations in data structures such as LSM trees. The physical metadata is designed based on that. However, this model has an extremely high cost for high-TPS (transactions per second) operational workloads. For those cases, the system needs to have a mutable storage data structure, log, and corresponding metadata to support cheap and fast data mutations. Once that storage is supported, a major problem is how to consolidate the two types of metadata.

Snowflake’s query engine works across the metadata models by:

Establishing a common consistency and isolation semantic. This is the first step towards addressing the major challenge of providing transactional consistency across data and metadata operations spanning different storage layers such as analytical and hybrid tables.
Leveraging the metadata hierarchy to represent both types of data. The hierarchy allows Snowflake to efficiently prune and optimize query plans while parts of the query address data that resides in different storage systems such as analytical and hybrid tables.

In addition to supporting the two types of metadata and seamlessly working across them, the system needs to manage per-query overhead. Observability into hybrid analytical and transactional workloads is particularly challenging. Transactional systems often provide observability at the workload level, since the overhead of persisting such information per-query is not acceptable. Analytical systems, on the other hand, can significantly benefit from per-query observability as queries are much more diverse and the overhead is less pronounced. To solve this problem, Snowflake adaptively adjusts the level of insight into the workload depending on its static and runtime characteristics.

Finally, scheduling workloads at different levels of granularity is another major challenge for hybrid systems. Transactional workloads tend to be predictable and latency-sensitive; therefore, it makes sense to make scheduling and isolation decisions at the workload or session level. Analytical workloads, on the other hand, are unpredictable and diverse. However, they have less stringent latency requirements. Therefore, more granular scheduling and isolation decisions are the best approach. The hybrid system needs to work with both types of scheduling and isolation levels while managing a pool of cloud resources for maximum utilization.

Global Cloud Service

The Snowflake Data Cloud spans across cloud providers and regions, while maintaining high availability and durability to address the proliferation of mobile and machine-generated data. Here we highlight a few aspects necessary to achieve that.

Disaster Recovery

Supporting data storage and processing across all major cloud provider regions requires an efficient and reliable data replication and failover framework. This cross-cloud cross-region replication provides Snowflake customers with durability and disaster recovery guarantees beyond what is provided by cloud providers today. Additionally, in order to run Snowflake compute and storage across multiple cloud providers, Snowflake has built generic cloud-agnostic infrastructure while taking advantage of cloud-specific technologies in specific cases.

High Availability

In order to provide high availability and support critical customer use cases, Snowflake employs a range of techniques such as:

Production replay, generative, stress, and performance testing. An expansive testing infrastructure enables Snowflake to ensure the correctness and performance of all features being released through static or automatically generated test scenarios. Additionally, features take advantage of production replay testing to add another layer of safety before the rollout begins.
Decoupled and automated feature rollout. Snowflake’s release management infrastructure supports independent rollout of various features and provides automated monitoring and rollback mechanisms to ensure that new features do not impact the availability or performance of the service.
Instrumentation, logging, and monitoring that can scale across all regions and deployments. In order to achieve this, we use Snowflake itself to ingest, store and process instrumentation and telemetry data. This approach not only allows scaling to all regions and deployments, it also allows providing security, privacy and data residency controls though existing Snowflake data governance features.
Elastic, self-healing service. Snowflake cloud services automatically adapt to workload patterns across regions and deployments by adjusting the serving capacity and by isolating or redirecting workloads to minimize interference. Additionally, the serving layer takes corrective action to mitigate a variety of failures during query execution by transparently adjusting working parameters of the query and automatically re-executing it.

Conclusion

Snowflake provides a unified, comprehensive, global and highly available data cloud to power data and AI applications. Doing so requires overcoming key design challenges. In this post, we presented several storage and compute architecture choices we made and how they enable data cloud features. In future posts, we will discuss these architectural choices in more detail.