Complementing an Existing Data Lake with Snowflake

A Data Lake is a centralized repository where you can store all of your structured, semi-structured, and unstructured data on any scale. You are able to store your data “as is”, without needing to structure it beforehand. Traditionally the Data Lake has been separated from the Data Warehouse and therefore creates different data silos.

Snowflake’s platform was built from the ground up to provide Data Lake and Data Warehouse in addition to Data Engineering, Data Science, Data Apps, and Data Sharing workloads. It is able to store incredible large amounts of data with cloud costs (Volume), store data in many different native formats as-is, without needing to structure it, and allow for schema-on-read semantics (Variety), and ingest, process and analyze data at high speed with shorter SLA (Velocity).

Snowflake goes even beyond because of its data-sharing capabilities that include unstructured data, providing governed access to files as explained in this previous blog entry Governed Access to Files and Data Sharing using Snowflake Unstructured Data Management.

There are some situations where customers have already invested in a Data Lake and want to use Snowflake as a fast query layer with governed access to data. In many situations, also use virtual warehouse compute Snowflake capabilities to perform transformations where the Data Lake data can be augmented and enriched to power analytics dashboards, ML models, or even write that curated data back to the Data Lake. We are going to provide an example of this use case in this blog entry.

Snowflake has External Tables, Materialised Views, Data Lake Export features that we will be demoing in this entry. Our goal is to aggregate and enrich the well-known Citibike example very familiar for many of you from the highly recommended virtual hands-on lab Getting Started with Snowflake — Zero to Snowflake. We have already proved in a previous blog entry, Building Better Machine Learning Models using Snowflake Data Cloud how we can get better predictions in an ML model when adding more features.

In this example, we are starting with trips data which already exists in an external Data Lake in parquet format. We create a Storage Integration pointing to the S3 location and we define the IAM roles to be used. On the AWS size, we allow Snowflake storage integration to have access to our S3 location. We create a staging area and define an External Table that will be used to query data located in our Data Lake.

As seen in the next figure, we can query the S3 layer via the staging area and verify the parquet format or we can just query our external table as a normal table.

Our goal is to provide an aggregated view where we can see the number of trips per day, the weather conditions, and the number of covid cases for that day in the New York area (as this is Citibike service for New York). Instead of querying the External Table each time, we are going to create a Materialized View where we aggregate the number of trips per day. This is an example, but it is always good to check whether we should be using a Regular or a Materialized view.

Our next step is to join our trips data with the weather conditions for each day. We have ingested 125M records in JSON format into Snowflake using the Variant data type.

There were no transformations and we just left the data in the original format. Now we can just query it and flatten it. We create a new view based on our Weather table and the Materialized View from the Data Lake.

Snowflake Data Marketplace provides access to live, ready-to-query data, either from the Public section or from Private Exchange with suppliers, partners, line of business, etc. It removes all ETL and APIs and all data is ready to be queried. Here we are going to use the Starschema data set for Covid19. We have immediate access to all that data without having to copy or move it into our Snowflake account.

We first create a view to filter New York cases and use that one to join with the previous view so now we have our final result with the number of trips, weather conditions, and covid cases for each day.

Now we would be able to provide governed access to that view, so for example our Data Scientist team could query that table without having to access our external Data Lake, and saving a lot of time preparing all those features from different data sets. We can also copy the output of that enriched view back to our Data Lake using the Data Lake Export feature.

Now enriched data is in our Data Lake, and we can also query it if needed.

This was simple but there are many more features that can be included like Streams and Tasks to automate transformations together with ANSI SQL, Stored Procedures, or Snowpark for more complex transformations, all without the need for an external system. Use Snowflake elasticity with Instant Warehouse Resizing and provide instant copies with Zero Copy Cloning. Data exploration can be accelerated with Snowsight and of course, you have all the Data Governance capabilities like Dynamic Data Masking, Row Access Policies, Secure Data Sharing, Object Tagging, etc that can be applied in this scenario, reducing operational complexity, creating robust data pipelines and reducing data latency since data is generated until insights are produced.

Enjoy!

Carlos.-

--

--