The Snowflake Lakehouse?

Disclaimer: I am Senior Solution Architect at Snowflake with 16 years of data strategy, architecture, and engineering experience. The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or future employers. The source code used in this post can be found on GitHub.

Reviewing legacy data architecture

Previous to Modern Data platforms, Data Lakes and Data Warehouses were monologic beasts that did not interoperate well. Organizations maintained various platforms such as Hadoop for unstructured data, alongside more traditional databases for storing structured data. It became (and still is for many) arduous to move data, manage security, upgrade systems and their dependent code, and keep the lights on to keep data flowing. This silod architecture adds technical debt, is challenging to manage and scale, and does not deliver on the promised long-term benefits. The reference architecture depicts the silos created with these disparate data stores.

Legacy Data Architecture

Journey to the Cloud

Snowflake pioneered the way for a cloud-native data stack by initially supporting the Data Warehousing use case. This modern approach blew the doors off on-prem solutions that could not keep pace with the volume of data coming at them. The infinite compute and storage of the cloud, plus the pay-as-you-use model, significantly lowered costs and reduced the time to analyze massive amounts of data. In addition, the cloud provides an always-connected, continuously updated platform with no upgrades or patching required, which increases organizations’ security posture due to aging hardware and software. As Snowflake matured, it continued to evolve with capabilities to seamlessly load, share, and process data with additional connectors and supported languages. Today, Snowflake provides a robust platform capable of processing secured and governed data for both Data Warehouse, and Data Lake use cases.

Is Snowflake really a Lakehouse?

Yes, and a whole lot more. The lakehouse architecture supports structured, semi-structured, and unstructured data in a common platform, with unified governance, security, and control. Snowflake has been providing support for structured and semi-structured (JSON, Avro, ORC, Parquet, XML) since inception but now also supports Unstructured Data. The difference with Snowflake over other Cloud Data platforms — it just works.

As a developer/engineer, this means less time working with infrastructure and more time working on wrangling data or fitting models. As a DBA, this means focusing on higher-value activities like securing and governing data rather than applying patches and managing indexes. As a business user, this means quicker time to insight both in terms of delivering analytic products and having a stable platform to efficiently load and transform data at scale.

Snowflake is a lot more than a combined Data Lake + Data Warehouse. It’s a Cloud Data Platform that supports Application Developers with Streamlit, Data Scientists with Python, and Business and Analytic users with access to 3rd party or shared data in real-time. All serverless. No infrastructure to manage.

Modern Cloud Data Architecture, which supports Lakehouse

Realizing this Vision on Snowflake

How, you ask? Okay — so onto the fun part. Let’s see this in practice.

I want to state that this “demo” demonstrates how you can easily use the Snowflake Data Cloud to load, transform and analyze any data within a single platform. I do not get into performance or cost, as I feel that depends on your specific data and use case.

But first, some background on this demo.

An organization’s legacy invoice management system generates PDF files of every invoice and sends them to its customers. These files are then stored locally for record keeping. In addition, you maintain a CRM with your customer data that needs to be combined with your invoice data for further analysis. Right now, it’s a very complicated and timely process to analyze data from the CRM and the invoice data. We will be using Snowflake’s Data Platform to combine and analyze this data. The solution is broken into three parts:

  • Loading
  • Transformation/Data Processing
  • Data Analysis

Loading

We are given local directories of Parquet and PDF files. The Parquet files are for historical customer data. The PDFs are for invoices that need to be processed. We will perform a bulk load of this data into Snowflake from a local repository. Note: there are numerous ways to ingest data — from external locations (AWS, Azure, GCP) and near-real-time processing with Snowpipe.

Let’s look at our data — Customer Parquet data files(~100k records @ ~13MB compressed) and Invoice PDF files (~10k files at 1GB uncompressed). Note that multiple line items exist for each Invoice.

Sample Customer and Invoice Data (Parquet and PDF format)

Using SnowSQL, we can load this data into Snowflake Internal Stages with a few lines of code

--lets review our objects and take a quick look at the stages
LIST @PARQUET_DATA_STAGE;
LIST @PDF_FILE_STAGE;
The directory structure mirrors what was on the local disk, but the files are in Snowflake’s internal stages.

Transformation and Data Processing

Now that the files are loaded into Snowflake, we need to transform the data so users can query it for analysis. The Parquet files can be read using Snowflake’s native semi-structured functions, but the PDF files need additional file processing libraries to read PDFs. Snowflake supports Java, Scala, and Python (in Public Preview) to extend the programmability of Snowflake.

Java has an open source library, pdf box, to extract data from PDF files that we can use with a Java User Defined Function (UDF). Here is the code to create the UDF. We also need a new stage to load the jar file referenced in the UDF.

Let’s process a single PDF file using the UDF and view the text output.

select process_pdf('@pdf_file_stage/INVOICE_NO_INV-0K6IS-5120.pdf');
Text output of a processed PDF file

Now we can process all the files and save them to a table. Because the process is memory intensive, we can scale our Virtual Warehouse up for additional computing power. This one-line ALTER command instantly scales up from 1 node to 32.

That only took ~40 sec to process 10,000 PDFs of 1GB in size! The PDFs were processed to text using the Java UDF, and we can now parse them using Snowflake SQL functions.

Let’s load the invoice data into structured files and process the customer Parquet from our stage. We will create three new tables for CUSTOMER, INVOICE, and INVOICE_LINES. 1-M Customer to Invoices. 1-M Invoice to Invoice Lines. And because the Customer data is stored in Parquet file format, we are able to make use of features that enhances the data lake use case — schema detection allows us to infer the schema based on the files before loading them. And we can similarly load from the stage into our newly created table without defining columns, protecting us from schema evolution and timely data engineering tasks.

Let’s validate that the data was sussessfully loaded for our three tables.

SYSADMIN#SNOW_DB.SNOW_SCHEMA@COMPUTE_WH>
SELECT COUNT(*) FROM INVOICES;
+----------+
| COUNT(*) |
|----------|
| 10443 |
+----------+
1 Row(s) produced. Time Elapsed: 0.089s
SYSADMIN#SNOW_DB.SNOW_SCHEMA@COMPUTE_WH>
SELECT COUNT(*) FROM INVOICE_LINES;
+----------+
| COUNT(*) |
|----------|
| 31329 |
+----------+
1 Row(s) produced. Time Elapsed: 0.117s
SYSADMIN#SNOW_DB.SNOW_SCHEMA@COMPUTE_WH>
SELECT COUNT(*) FROM CUSTOMER;
+----------+
| COUNT(*) |
|----------|
| 99999 |
+----------+
1 Row(s) produced. Time Elapsed: 0.186s

Awesome! With our data loaded and transformed, let’s see how we can use this data to gain business insights.

Data Analysis

We will use streamlit (acquired by Snowflake) to show the power of Snowflake with three different use cases with the data we just transformed.

  • Customer Invoice Lookup — Browse Customer Invoice PDF files served from Snowflake.
  • COVID19 Sales Map — An overlay of COVID19 shared data with sales data.
  • Predictive Sales Forecast — Used a predictive model using Python’s statsmodels library executed from Snowpark.

Below are the applications! Note that I don’t show the source code to all the streamlit apps, but you can access them from my GitHub repo.

Customer Invoice Lookup — This page allows users to search for an invoice based on a customer name and, subsequently, the invoice number. The resulting PDF is served directly from Snowflake’s internal stage using a directory table and providing access to the URLs. There are three methods to provide file access which make it to consume unstructured data in a secure and governed manner.

COVID19 Sales Map — This app uses Snowflake’s direct data sharing. We will get data from Starschema on the Snowflake Marketplace and join it to our Invoice Data, which is updated in real-time once new COVID data is published. You can quickly get this data set with a few clicks and no ETL or data engineering. We can subscribe to the data set and validate that it has updated COVID19 info.

COVID19 Data in Snowflake Marketplace
SYSADMIN#SNOW_DB.SNOW_SCHEMA@COMPUTE_WH> 
SELECT MAX(DATE), SUM(cases)
FROM COVID19_BY_STARSCHEMA_DM.PUBLIC.JHU_COVID_19
WHERE LAST_REPORTED_FLAG = TRUE
AND CASE_TYPE = 'Confirmed'
AND COUNTRY_REGION = 'United States';
+------------+------------+
| MAX(DATE) | SUM(CASES) |
|------------+------------|
| 2022-07-25 | 90567138 |
+------------+------------+
1 Row(s) produced. Time Elapsed: 0.323s

Next, let’s create a view to help with our sales map.

We then select from the above view and plot the number of Covid cases as the diameter of the red circle per state. The blue columns represent the Item Price (Sales) data gathered from the invoice PDFs for each state for the last 30 days. Remember that this map will auto-update COVID data with no additional processing.

COVID19 Sales Map using Streamlits PyDeck and Starschema COVID data

Predictive Sales Model (Run from Snowpark) This application uses Snowpark to natively connect with a Python Machine Learning Model to predict forward-looking sales. Snowpark moves the complex computations to the cloud, which utilizes the scale and elasticity of Snowflake Virtual Warehouses. I use the SARIMAX algorithm from Python’s statsmodel package and the model results for this demo. You can use the slidebar to select how many days of data you would like to predict. I am not a data scientist and don’t claim the model used makes good business sense — it was selected for demo purposes.

Updated Jan 2024:

The model above can still be created in Snowpark but Snowflake now has GA Timeseries predictive models that can be easily used to perform this: https://docs.snowflake.com/en/user-guide/ml-powered-forecasting

You can use the below code

--create a view to combine the invoice and invoice line data to obtain price by date 
CREATE OR REPLACE VIEW INVOICE_DATA AS
SELECT SUM(ITEM_PRICE) TOTAL_PRICE, TO_DATE(INVOICE_DATE)::TIMESTAMP INVOICE_DATE
FROM INVOICES I , INVOICE_LINES IL
WHERE I.INVOICE_ID = IL.INVOICE_ID
GROUP BY TO_DATE(INVOICE_DATE)::TIMESTAMP
ORDER BY INVOICE_DATE;

--create a new model with the data listed in the view. The two columns we needs
--are the invoice date for timestamp col and price for the target col
CREATE SNOWFLAKE.ML.FORECAST SAMPLE_MODEL(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'INVOICE_DATA'),
TIMESTAMP_COLNAME => 'INVOICE_DATE',
TARGET_COLNAME => 'TOTAL_PRICE'
);


--call the sample model with the period. I selected 10 but this can be changed
--based on the number of days you would like to forcast for. 10 generates 10 days forward
SYSADMIN#SNOW_SANDBOX.SNOW_SCHEMA@COMPUTE_XS_WH>call SAMPLE_MODEL!FORECAST(FORECASTING_PERIODS => 10);
+-------------------------+------------------+------------------+------------------+
| TS | FORECAST | LOWER_BOUND | UPPER_BOUND |
|-------------------------+------------------+------------------+------------------|
| 2022-07-18 00:00:00.000 | 185349628.460623 | 180653554.792768 | 190045702.128478 |
| 2022-07-19 00:00:00.000 | 186231818.000475 | 181212312.986578 | 191251323.014372 |
| 2022-07-20 00:00:00.000 | 187133874.483032 | 181810552.843759 | 192457196.122305 |
| 2022-07-21 00:00:00.000 | 186360678.372989 | 180749967.547633 | 191971389.198346 |
| 2022-07-22 00:00:00.000 | 186319514.747199 | 180435434.64438 | 192203594.850017 |
| 2022-07-23 00:00:00.000 | 188516426.860473 | 182371126.107295 | 194661727.61365 |
| 2022-07-24 00:00:00.000 | 186131383.82365 | 179735522.346287 | 192527245.301013 |
| 2022-07-25 00:00:00.000 | 185999505.123777 | 179362535.429087 | 192636474.818468 |
| 2022-07-26 00:00:00.000 | 186342313.814648 | 179472693.052273 | 193211934.577023 |
| 2022-07-27 00:00:00.000 | 185975581.781083 | 178880935.059119 | 193070228.503047 |
+-------------------------+------------------+------------------+------------------+
10 Row(s) produced. Time Elapsed: 7.497s

Conclusion

With legacy data architectures, organizations had to choose where to put their data, creating additional data silos and bottlenecks. This is simply no longer the case. As demonstrated, Snowflake is more than just a Data Warehouse, making it easy to ingest, transform, and analyze any data at scale.

Arguably one of the best features since Snowflake’s inception is the ability to separate storage and compute. With the expansion of Unstructured data, this becomes more valuable since you can effortlessly scale compute for working with unstructured data processing (Java, Scala, or Python) in the same manner you would for SQL workloads. In sum, the ease of use of Snowflake is a lot more valuable as Data Lake, and Application workloads are fully integrated into the platform.

And because the cloud enables automatic updates and newly released features (e.g., Unistore, Native Application Framework, Iceberg Tables), the value to shift to the Snowflake Data Cloud is ever increasing. Not just for today but throughout the evolving data landscape, providing organizations with a future-proof platform and delivering a competitive advantage for years to come.

--

--