Selling the Data Lakehouse

(Photo: From 12 Creepy Abandoned Places You’ll Probably Find Ghosts Living)

Introduction

The recent craze around the data lakehouse seems to me like much ado about nothing. The idea is to have a single platform to support all your data and all of your workloads (Data Warehouse, BI, AI/ML, Streaming). Well, Snowflake has been able to support these scenarios for a long time now and has even moved well beyond them. It’s my opinion that most (or all) of the hype is by competitors that are trying to catch up to Snowflake.

In my previous blog post, Draining the Data Lake, I made an impassioned plea for everyone to stop using the term “data lake” and all its variations like “data lakehouse.” And while I don’t advocate for keeping the term “data lakehouse” it is interesting to point out what it represents and where it came from. In my research, I’ve learned that the term's first use is believed to be from the Snowflake customer Jellyvision Lab in July 2017! Did you catch that, it was first used in relation to Snowflake. And what’s more, it was first used in relation to Snowflake about two full years before it got picked up and adopted by others!

In this blog post I will briefly discuss the key features of the data lakehouse, show that Snowflake has had those features for a long time, reveal how the term began (hint, the phrase was first used to describe Snowflake’s capabilities), and discuss what comes next.

So what is a data lakehouse?

While it seems that everyone is jumping on the “data lakehouse” bandwagon lately, the biggest push seems to be coming from Databricks. Early last year (at the end of January 2020) Databricks published the blog post What is a Lakehouse? in which they tried to make the case that Databricks is the best platform to meet the requirements of the data lakehouse. But is that true? And why didn’t they mention Snowflake at all in their article? In order to answer those questions, we first need to understand what a data lakehouse is supposed to represent.

Here’s my summary/definition of the data lakehouse:

A data lakehouse is an architectural approach for managing all of your data (structured, semi-structured, or unstructured) and supporting all of your data workloads (Data Warehouse, BI, AI/ML, and Streaming)

But wait, hasn’t Snowflake had those features for a long time? The answer is yes. Let’s briefly run through each of those requirements and see why Snowflake is the original data lakehouse.

Managing all of your data

As I discussed in my previous blog post, Draining the Data Lake, the reason that file-based data lakes were created was to address the shortcomings of traditional data warehouses. But file-based data lakes have resulted in yet another data silo and are difficult to manage. This was the exact issue that the Snowflake founders were trying to solve: Combining the data warehouse and data lake into a single platform. And that’s exactly what they’ve done. From the beginning, Snowflake has had native support for both structured and semi-structured data, and eliminated the need for a file-based data lake.

What about unstructured data like images, scanned documents, etc.? While many relational databases have offered the ability to store unstructured data for a while (in binary or blob data types) the reality is that it wasn’t cost-effective or even possible in most cases to store it all there. It also wasn’t practical to process that data from within the relational database. So the traditional answer has been to leave unstructured data in a file-system and process it from there with systems and services outside the database.

Snowflake already has incredible support for working with unstructured data, and during the Data Cloud Summit back in November 2020 Snowflake announced new features to make that integration even better (see the “What’s Next in the Data Cloud” session with Christian Kleinerman starting at 36:02 and the Data Cloud Summit recap for more details). But today with native cloud storage integration and the External Functions feature you are able to process any unstructured file with any process or service you want and use or store the results in Snowflake.

Managing all of your data workloads

The goal of Snowflake, from the beginning, was to build a single platform that could support all of the data workloads in an organization. The three primary data workloads in an organization highlighted by the data lakehouse are DW/BI, AI/ML, and Streaming. While much more could be said about each of these, here is a brief overview of how Snowflake supports each workload.

The Data Warehouse (DW) and related Business Intelligence (BI) workloads have been and remain the bread and butter of analytics. From the beginning, Snowflake has supported all of the key DW features including governance via granular Role-Based Access Control (RBAC), a robust ANSI SQL implementation that supports all types of operations (DQL, DML, DDL), multi-statement transactions, ACID compliance, rich programmability features, and much, much more. But unlike all other DWs, Snowflake did this in a unique way by building a new platform from the ground up for the public cloud. This means that Snowflake can truly take advantage of the scale offered by the public cloud. For more details on Snowflake’s DW/BI capabilities see the Snowflake as Your Data Warehouse landing page.

The second key workload to consider is AI/ML, which Snowflake has also supported for a while now. Study after study finds that data scientists spend around 80% of their time searching for and preparing data. Snowflake can eliminate most of this wasted effort by consolidating all of your data (in its native format) and providing the data scientist with near-unlimited performance and scale along with the ability to do feature engineering and transformations using SQL, Java or Scala. All major AI/ML tools, libraries, and scripting languages work very well with Snowflake thanks to its various connectors and drivers. This includes JDBC/ODBC, Python, R, and Spark connectors. What’s more, Snowflake has a rich partner ecosystem in the AI/ML space. This includes DataRobot, Dataiku, H20.ai, Amazon Sagemaker, Azure ML, and many others. And if that weren’t enough Snowflake announced new features like Java/Scala and Python functions which will allow AI/ML workloads to run directly inside Snowflake (see the “What’s Next in the Data Cloud” session with Christian Kleinerman starting at 14:40 and the Data Cloud Summit recap for more details). For more details on Snowflake’s AI/ML capabilities see the Snowflake for Data Science landing page.

The third key workload to consider is Streaming. And probably the single most important question to ask whenever anyone mentions streaming data is “what specific latency do you have in mind?” Streaming can mean anything from millisecond latency to a few minutes to thirty-plus minutes. I’ve found that for most customers a streaming latency of around 5–10 minutes is acceptable. And Snowflake has been able to support that latency for a long time now with its Continuous Data Pipelines (which includes Snowpipe, the Snowflake Connector for Kafka, Tasks, and Streams), or with other partner replication tools (like HVR, Replicate, Fivetran, et al.). In fact, Snowflake latencies today are easily half of that range. Now there are certainly organizations that have streaming requirements with latencies in the few seconds to milliseconds, but often those solutions are built with dedicated streaming tools and solutions.

The big reveal … it’s always been about Snowflake

On July 13, 2017, Jeremy Engle, then the Engineering Manager Data Team at Jellyvision, presented to an AWS user group in the Chicago area on “Strategies for supporting near real-time analytics, OLAP, and interactive data exploration” (his slides can be found here). He discussed their need for near real-time ingestion of semi-structured data and the challenges they had working with and getting value out of their data lake. He was looking for something that gave the benefits of both the traditional data warehouse and the data lake:

And the answer he gave on that July day in 2017 was Snowflake. But that’s not all, he also coined a new term to describe what the Snowflake platform already provided at that time. You’ve probably already guessed it, but that term he coined was “data lakehouse”. Here’s the slide to prove it:

So the term “data lakehouse” was first used on July 13, 2017, to describe the Snowflake platform and its ability to support both data workloads.

The really big reveal … there’s so much more

As I hope I’ve demonstrated, Snowflake is the original “data lakehouse”. While I don’t like the term because the “data lake” has today become synonymous with a siloed, file-based data store (see Draining the Data Lake), it is interesting to point out that Snowflake has had these capabilities for a long time now! Here’s my summary/definition again for the data lakehouse:

A data lakehouse is an architectural approach for managing all of your data (structured, semi-structured, or unstructured) and supporting all of your data workloads (Data Warehouse, BI, AI/ML, and Streaming)

But even when competitors do catch up to Snowflake in these areas, there is still so much more that Snowflake offers today that they will be missing! Today Snowflake is so much bigger than this current data lakehouse architectural vision and while these are all good things I would argue that organizations want more than this limited architectural approach. They want the Data Cloud.

The Snowflake Platform is delivered as a service with near-zero maintenance and spans all three major public clouds providers. It offers a single platform that supports all of your data and all of your business workloads with centralized security and governance. But there’s more. It also powers the Data Cloud, a global network that connects the world’s data. Customers can securely share data and code, across cloud regions and even across cloud providers, without having to create and maintain costly ETL jobs to transfer the data and keep it data up-to-date. This is accomplished through the Data Marketplace, private Data Exchanges, or direct shares. The Data Cloud is here.

So use Snowflake and let’s sell that old data lakehouse! And one last time, please, please can we stop using the terms “data lake” and “data lakehouse” :)?

--

--

Jeremiah Hansen
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

I’m currently a Field CTO Principal Architect at Snowflake. Opinions expressed are solely my own and do not represent the views or opinions of my employer.