Data lakehouse architecture beats traditional data warehouse & data lake

Published in

data-surge

7 min readJul 13, 2021

Over the last few years, the term “data lakehouse architecture” has starting bubbling up in the big data scene. So, what is this architecture? How does it tie in with your ongoing data efforts of ingesting, cleansing and transforming, processing data? And then having confidence in the data delivered, and, extracting actionable data insights from this data? All this while trying to keep up with the changing data demands (whether it be velocity, schema changes or data requirements)? Does it have answers to all your data troubles?

The Big data term was coined for the first time in 2005 by Roger Mougalas from O’Reilly Media. Ever since then, there has been a surge in the hundreds of products and services (both open source and commercial) that try to offer respite to your data troubles. There is a huge value in driving key business decisions by combining Data and AI in basically all the different industry verticals — Healthcare & Life sciences, Financial services, Retail & CPG, Media & Entertainment and Public Sector. Improved recommendations have driven conversions and produced $30MM Revenue in online retail, 90% cost savings in media company by reducing their infrastructure costs from performance, reducing food spoilage with predictive ordering has had $100MM cost savings in brick and mortar retail, faster fraud detection from massive data sets has seen $200MM cost savings in financial services etc. [Source: Databricks Webinar].

The question is how do you pick the architecture with services that fits your data needs? There is much confusion on what is the right tool for the right job since there are multiple technologies offering similiar features and each claiming to be better than the other. Lets talk about a few things that you want to consider when you think about data architectures:

Identify the critical capabilities that you need at a holistic level
Understand your data by profiling your data from all the data sources to understand how you will ingest the data from each source — velocity of incoming data, frequency of data, veracity of data, consumers on data etc.
How will you combine operations, analytics and governance with the data ingestion and processing needs?
How big of a data infrastructure are you going to need to support your ever-growing needs of your data?
What are your data processing needs? Do you need real time or batch processing or a hybrid solution?
What does your data storage look like including the raw data, intermediate data and final data? Storage of the data for fast compute is a critical factor — one should be able to store large amounts of data of any format and have the ability to scale on as needed basis.

Critical drivers for modern architecture isn’t just about data volume but also about the variety and complexity of the data that you own. Its also about how many different users have data access and how you can break down the silos among them. A lot of companies believe that they don’t need a modern architecture because the evolution on their end isn’t happening right now since their data is small and simple. This is a fallacy because the evolution is going to happen and it is going to happen quickly. The companies that understand this and invest into their modern architectures from day 1 are the ones that are successful. You want the capability for today AND tomorrow. You need to speed up to get to the end happy state.

And that brings us back to our Data lakehouse architecture and how it can fit for fairly wide and different data use cases. It can provide the flexibitiliy that we need in today’s big data era.

Let me start with a bit of the back history —

Data Warehouses came into being to provide a carefully designed data structure which provided fast data insights. It was purpose built for BI, reporting etc. and all the data mangement capabilities were catered very well for that notion. Of course it had its own challenges with being exclusive for structured data, only SQL, no support for data science and ML, limited support for streaming and more importantly used closed and proprietary formats.

From there on, Hadoop came to life and soon thereafter, the concept of Data Lake emerged which helped store and process unstructured data. At this point, Data Warehouses were not able to keep up with the rising data needs. We saw a lot of data warehouse solutions including appliance data warehouse solutions fail. Data lake’s philosophy was simple and enticing — to dump all the data you need into one place for all different uses cases but what it didn’t account for was data quality enforcement, schema enforcements, data integrity and they soon became data swamps due to lack of structure and governance.

Then both Data warehouses and Data lakes started to coexist which was really expensive and was hard to keep them both consistent. Many users with different use cases spanning on both sides which created silos which is exactly what we want to get away from.

As the term “Data Lakehouse” suggests its an amalgamation of Data Lake and Data Warehouse. The Data lakehouse architecture strives to combine the resilience of Data Warehouse with the flexibility of a Data Lake. Data Warehouses provided highly performant data query-ability, reliable data quality and ease of use. The Data Lake was created to ensure the big data processing could keep up with the growing data demands, that the complicated data pipelines can be modularized to some extent and data scientists and data engineers could both take advtantage of the Data Lake. The Data Lakehouse got the best of both the worlds.

Before we dive into the high level architecture of data lakehouse, some of the key aspects that data lakehouse brings to the table are:

Centralizing the data storage while allowing support for both structured and unstructured data.
Modularity which helps with ease of data management and scalability of specific components
Supporting ML/AI use cases along with BI reporting use cases
Makes data lake swamps of the past — it helps to add a layer of data governance.
Robust security architecture that is implemented around the data storage tier

High Level Architecture

The figure below illustrates the conceptual design of an architectural data lakehouse solution. The Data Lake (storage) is central to all the services with the rest of the services working around the storage layer in a modular fashion. The data lake creates a clear separation by creating Bronze, Silver and Gold tiers. Bronze is the raw data layer where data is ingested from your various data sources, Silver is the normalized and augmented/enriched data processing layer, and, Gold is the aggregated layer where your data can be served off of to the end users, whether that be BI, Data Analytics, API etc.

Key features to note about this architecture:

Centralized storage Layer — Allows concurrent reading and writing of the data. Reduces data redundancy and movement of data by allowing different services to be connected to the data storage layer directly. The storage layer supports both structured and semi-structred data types, including IoT data.
Flexibility & Extensibility — You no longer need to create an ETL process to move aggregated data into a data warehouse. By bringing the multiple services to the data rather than creating ETL processes to push the data to them, you enable extensibility and modularity on the architecture. The modularity enables you to swap technologies in and out knowing how quickly the big data tech stack evovles. Also, makes it amenable to a streaming architecture.
De-couple storage and compute — Because of the flexibility of this architecture, it allows you to decouple storage and compute, making it easier to keep your data organized and catering to multiple different use cases.
Data governance — Data Governance becomes really easy and is a highly critical aspect of this architecture to ensure that you data storage layer does not turn into a data swamp. Centralized and organized data storage tier enables you to use a data governance management tool that can be monitoring the entire data storage tier from raw data all the way through the aggregated data. Data Organization in terms of standardized storage formats, schema enforcement helps ensure your data lineage is tracked from inception to data delivery.

Microsoft Azure Implementation of the Lakehouse Architecture

Now that you have a good understanding of what the Data Lakehouse Architecture is, lets dive a bit deeper into how the implementation of the data lakehouse might look like using Microsoft Azure.

One of the key technologies in this lakehouse stack is Delta lake. Databricks added the capability of structured transactional layer with the launch of Delta lake in 2018. We will dive more into this in a separate post.

If you would like us to evaluate and review your current progress with your Data Architecture, please email us at info@datasurge.com or complete the form on our contact us page.

Data lakehouse architecture beats traditional data warehouse & data lake

High Level Architecture

Microsoft Azure Implementation of the Lakehouse Architecture

Written by Anu