Implementing Data Lakehouse in an Enterprise

Lackshu Balasubramaniam
7 min readMar 6, 2022
Woodlands Historic Park

Data Lakehouse is considered the next leap in making data more accessible and more consumable in the enterprise. At a conceptual level it can be summarized as

  • applying data warehouse model on the data lake
  • building business level aggregations and metrics on top of the data model to be consumed via BI Tools.

Data Lakehouse also allows for streamlining data engineering processes and data governance. Building a Data Lakehouse is not a trivial process and would require investments and efforts in stages. Each project within the transformation program will need to:

  • make sense of the data available in the data lake for a business case
  • ingest/acquire data that’s not available
  • organize, integrate and enrich them to make data accessible and user-friendly for the business.

I’ve been looking at an exciting set of technologies coming into the fore at Databricks. The Databricks data platform has reached the tipping point for implementing Data Lakehouses due to leaps in the Delta Lake technology and adjacent technologies like Auto Loader, Delta Live Tables, SQL Endpoint and Unity Catalog

A data lakehouse have the following key features :

  • Transaction support for concurrent read and/or writes via Delta engine
  • Schema enforcement on delta tables to ensure data integrity
  • Governance and audit mechanisms. It’s possible to govern data seamlessly via Unity Catalog
  • BI support via SQL Endpoint whereby BI tools can access data effortlessly. SQL Endpoint also allows for easy interrogation by Data Analysts which leads to evolution the data model as it’s understood better.
  • Storage is decoupled from compute which is the promise of Spark. This capability allows for scaling of storage and compute independently.
  • Open storage formats, in this case Delta format. There is also possibility for integration into other platforms which support Delta via storage.
  • Support for all types of data : structured, unstructured, semi-structured. We need to be able to integrate all forms of data including audio, video, images and documents into the lakehouse.*
  • Support for Streaming which enables real-time reporting via integration of real-time data into the lakehouse.
  • Support for diverse workloads: data engineering, data science, machine learning, and SQL analytics

*However there’s significant work to be done to make the unstructured data available e.g. text processing for documents.

Conceptual Architecture

At a conceptual level, the diagram above depicts the layers that make up the lakehouse.

Data Ingestion/Data Acquisition

The source data could be Databases, Text Files (CSV, JSON, XML), Documents, Images, Audio, Video, Sensor Data (IOT). There are Spark connectors for various database sources and support for various file formats which allows for ingestion into the Data Lake. There is also a possibility of leveraging Azure Data Factory based ingestion into the Raw Zone. Subsequently, Databricks workflow processes data from Raw Zone into Trusted Zone and upwards.

Data Lake

The data lake would store source files in raw format and processed data would be landed into delta lake format(parquet files & transaction logs) based on the medallion architecture i.e. bronze, silver and gold tables. I prefer to separate the bronze tables into Trusted Zone and silver/gold tables into Curated Zone in the DL Storage.

Metadata, Caching and Indexing Layer

The metadata, caching and indexing layer is driven by Delta Lake client library. Delta Lake engine provides reliability, performance and governance on top of data lakes. The Unity Catalog leverages the metadata (including Hive Metastore) to provide a view of data catalog, data lineage and data governance for tables.

Data Engineering

I feel the data engineering capability would predominantly be driven by Delta Live Tables (DLT) moving forward. DLT provides streamlined orchestration i.e. transformation workflows and allows for data quality checks. It is possible to build frameworks on top of DLT as well for metadata driven data transformations and data movement.

Data Engineering could also be driven by conventional Databricks Delta Lake approaches i.e. leveraging existing Delta Lake capabilities including Change Data Feed. However, DLT is more advantageous as it has a streamlined and manageable approach to data transformation i.e. workflows.

Example of a Delta Live Table worflow

Data Transformation

The data transformation is driven by a curated data lake approach as shown below and the data flows through different stages into bronze, silver and gold tables.

BRONZE

Raw data from source systems or files is read using Spark and saved into Delta Lake Bronze table(s). At this point schema checks could be applied if needed. The base columns could also be converted to the correct data types if the source files are semi-structured. The processing of unstructured sources would be more involved depending on the data to be derived from the sources.

Bronze layer is what I used to call trusted zone in my past articles. The data would still be source system specific as this point. Some amount of logging fields could be added to the table, for example ‘source file’ and/or ‘system’ for lineage, ‘update date’, ‘create date’, etc. Some history tracking fields ‘from date’, ‘to date’, ‘current flag’ could be added as well for SCD2 behavior.

SILVER

Silver layer is what I used to call Curated/Refined Zone in past articles. This is where things get interesting. At this point we need to define Subject Areas based on tables/views in Enterprise and our learnings thus far from previous attempts at implementing Data Warehouses. The subject areas could also be derived from an existing or vendor provided enterprise data model.

Subject areas are broad business areas which fit into the organization like Finance, Operations, HR & Safety, Planning, IT etc. It would be best to implement the Silver layer subject area by subject area with involvement of the business. The subject area would also determine the schema name for the set of tables within.

This stage is where the most effort is. It gets to the nuts and bolts of Kimball modelling. We have room for some experimentation so it might not be as formalized and could very well be 3rd Normal Form data model.

The idea is to define 1st cut Conceptual Model to model business concepts inter and intra subject areas for example how account receivables or inventory management works in the organization. It’s where master data like customers, suppliers, products, etc. and business transactions like sales, settlement, inventory movement etc. are modelled.

One approach is to start small with projects that are in-flight:

  • Create Bus Matrix for business processes/subject area on a per project basis. This would allow for modelling at the right level of granularity.
  • Derive Star Schema(Conformed data model) based on Bus Matrix. ​
  • Combination of business process and business entity would be the basis of table names.​
  • A silver table could be derived from one or more tables in bronze layer. On the other hand, a bronze table could be decomposed into multiple silver tables.
  • It could also be a multistage transformation to get to the final silver table.
  • Define business rules to clean data and calculate derived columns

At the end of the process we would have a series of workflows/pipelines to clean, enrich and conform our data. Some feature engineering would be required as well for data science.

GOLD

The gold layer is where transformations are done for Data Analytics. It would be ideal to revisit calculations, aggregations and metrics/KPIs built in existing reports/analytics models (if any) to consider moving some of them to gold tables for performance and reusability.

This could also be the layer where Data Marts are built for specific business use case or business verticals. A dimensional model would be appropriate here as well. The visualization layer could then mirror this model for slicing and dicing data.

At this stage we would join a set of silver tables, narrow them down and aggregate them to fit our specific use case, saving the result as gold table(s). There could be issues in finding common keys to join the tables and a review of the transformation workflows might be required to realign the data model.

This could be a multi-hop process where multiple intermediate tables are created to form the final aggregated table. This step is heavy on application of business rules, calculated columns and aggregations.

Data Quality

The DLT pipelines/workflows allows for applying rules around data quality and also for logging skipped rows into a quarantine table. This is key for dealing with data discrepancies and ideally feeding issues back to source systems.

Data Catalog

There’s opportunity to define table metadata in the pipeline code to be surfaced in Unity Catalog as descriptive attributes. This includes comments around tables, type of table i.e. bronze, silver, gold, table properties, etc. Lineage information is also derived from the code.

Data Governance

Unity Catalog is being touted as a solution for data governance and is currently in private preview. Via SCIM provisioning we would integrate Azure AD accounts into the workspace. We could leverage Unity Catalog as the basis of providing permissions into assets via grant statements to assets. These permissions would apply at SQL Endpoint and Interactive Cluster levels.

References

Building the Data Lakehouse — Bill Inmon, Mary Levins, Ranjeet Srivastava

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics — Michael Armbrust , Ali Ghodsi, Reynold Xin , Matei Zaharia

--

--

Lackshu Balasubramaniam

I’m a data engineering bloke who’s into books. I primarily work on Azure and Databricks. My reading interest is mostly around psychology and economics.