Design Patterns for Data Lakes

7 min readApr 5, 2020

Introduction

Data Lakes are the heart of big data architecture, as a result careful planning is required in designing and implementing a Data Lake. I’ll walk through my findings and thoughts on the topic in this post.

On a side-note, I do love the metaphor of water as data, reminds me of the famous Bruce Lee quote.

Empty your mind, be formless, shapeless — like water. Now you put water in a cup, it becomes the cup; You put water into a bottle it becomes the bottle; You put it in a teapot it becomes the teapot. Now water can flow or it can crash. Be water, my friend.

Moving on, to help illustrate the concepts I’ll use a modern data warehouse architecture as per the diagram below.

When I look at Lambda Architecture or Kappa Architecture, I feel Data Lake does not supersede a Data Warehouse. They each serve their respective functions. The technologies/products highlighted above are representative and could be replaced by equivalent products from different vendors.

Note: Azure SQL Data Warehouse is now known as Azure Synapse Analytics.

In the diagram above there are different layers that fulfil each stage of preparing data for consumption:

Ingestion layer that ingests data from various sources in stream or batch mode into the Raw Zone of the data lake.
Transformation layer which allows for extract, load and transformation (ELT) of data from Raw Zone into the target Zones and Data Warehouse.
The Data Integration capability at the top enables the movement of data.
The Data Lake is the storage capability that allows the high-speed processing of data and storage of large volumes of data.
The Semantic Layer can be implemented in the Data Lake via Delta Lake or suitable file formats, however, Data Warehouse has far more mature capabilities in modelling and serving the data to users.
Analytics and Reporting Layer provides the data consumption capability and the opportunity to build a sandbox environment for experimenting with the data in tandem with the Analytics Zone in the Data Lake.

As can be seen in the diagram above the Data Lake plays a central role for storing and processing large volumes of data.

Benefits and Challenges

The book Architecting Data Lakes expounds the benefits of Data Lakes as outlined below:

Schema-on-read (which speeds up data movement and defers applying schema to data).
Can store raw as well as refined data
Low cost storage and processing
Scales at low cost
Multiple avenues for data access including SQL-like syntax
Ability to handle both Batch and Stream processing
Amenable to complex processing

However, the challenges in implementing Data Lakes are:

Visibility of data as there’s a risk the Data Lake becomes a dumping ground for data and users get overwhelmed with the deluge of data.
Also there’s a real risk of the Data Lake becoming a Data Swamp due to a lack of governance and oversight. A balance between top-down and bottom-up approaches to landing data into Data Lake is necessary.
Data Governance will need to be factored in early in the design.

Data Lake Zones

As far as data lake zones are concerned, I would avoid over-engineering them. Start with the basic zones and add more zones as required by the business. Allow the zones and data engineering infrastructure around it to grow organically. The zones I found useful are listed below.

Raw

Locked down security wise. Typically, only service principals/accounts and administrators would have access.
To store data from source as is.
For DB sources, data might be stored as Avro, Parquet or ORC formats. This will allow for partitioning, bucketing and sizing strategies, which will optimize downstream data processing.
Depending on regulations that are in place ingestion process could mask or tokenize sensitive data
Retained indefinitely and immutable
Folders are time based for easier roll forward to downstream zones.
I would keep initial (first time) ingestion folder in the same structure rather than in a separate structure. This will keep the ELT process simple.
Can regenerate downstream zones

Structured

Data for Analysis thus would have a standard structure across the board to be intuitive.
Used for a smaller base of Data Analysts and Super Users; individuals who are comfortable with querying data.
Data is managed at entity level (not in terms of periodic ingestion) and could be stored in delta or parquet format. I tend to prefer parquet to keep it simple. Delta adds a transaction log on top of the data structures and the capability might not be required at this point.
Additional columns are added during the transformation for auditing and lineage purposes.
Entities are query-able via Apache Hive for easier consumption.
Only entities that are required are loaded into the zone.

Trusted

Data for general consumption. Entities are query-able via Apache Hive for easier consumption. Reporting layer could directly connect to Trusted layer.
Only entities that are curated are loaded into the zone. Curating data would involve significant data engineering efforts.
Enriched with information. Could have entities combined into higher level entities.
Organized for optimal delivery. Delta format is the ideal format as it caters for ACID transactions. The technology still needs to mature for easier consumption by reporting layers.
Can be used to store data at entity level and aggregated, summarized levels i.e. bronze, silver, gold tables approach. As a result, the zone could be used interchangeably with data warehouse for some cases.

Analytics

Exploratory analytics for data science
Prototyping
Could be pushed back into Trusted upon generating useful insights.
This zone is for sandboxes and doesn’t have to be part of mainstream data engineering.

Transient

To hold data in transit. Especially when moving data between zones or zone and data warehouse.
The storage will need to be cleaned up periodically or as part of the data engineering process post data movement.

Data Processing

Ideally data processing should be metadata driven for easier management of extract, load and transform (ELT) processes. As per the earlier diagram there is a clear separation of data processes based on the zone where the data is landing as described below :

Data Ingestion

This layer ingests data from various sources into the Raw Zone
A batch ingestion mechanism like Azure Data Factory (ADF) would be used to ingest batch data sources like databases or file extracts.
A stream ingestion mechanism like Spark Streaming/Event Hub/IOT Hub would be used to ingest (near) real time sources like IOT data, click-stream/log events or Database change data capture (CDC).
Ingestion processes mask or tokenize sensitive data as prescribed by regulation
The data formats will not change apart from database sources where they’ll land as AVRO, Parquet or ORC files.
Logging to indicate the data source, row counts, etc.

Data Transformation

This layer standardizes and transforms data for downstream consumption.

Rule based and would be driven by consumption models.
Identification of security constraints based on user groups.
Compliance check for regulatory requirements. Classification of data as public, internal, sensitive, restricted, etc.
Entity mapping, column mapping and augmented columns as required.
Data quality check, data cleansing and data enrichment as part of curation process when moving to Trusted Zone.
Data movement from Data Lake into Data Warehouse should be a seamless process. For Azure Synapse Analytics, Databricks and Polybase allow for easy movement into the data warehouse staging area.
ELT processes in Data Warehouse can be driven by Data Vault, Dimensional (Kimball) or both approaches
Some data profiling would be useful. Levels of data profiling will depend on time windows for data engineering processes.
Logging to indicate business unit, business process, data source, targets, row counts pre and post-processing etc. This data will be useful for data governance.

Data Consumption

Analytics and reporting can be achieved through various tools, the common ones are Tableau and Power BI. With the PolyBase capabilities in Azure Synapse Analytics data can be accessed from where they live i.e. joining data in data lake as external tables to tables that reside in the data warehouse.

Data Lake Management

I feel data management for Raw Zone should fall to IT through metadata and data engineering processes. Structured Zone and upwards, business functions should manage data through data governance processes (working together with data engineering processes).

Some of the considerations around data management are

Data quality and consistency for business to use in decision making
Policies, standards and regulations around ingesting, transforming, consuming data
Security, privacy, and compliance which ties to how data is laid out as well as authentication and authorization of users
Data life cycle management which includes archiving data when the data is aged.

Data Governance

Existing business processes will need to leverage Data Catalogue, Business Glossary and Data Lineage to consume the data effectively. These will provide the ability to interrogate the data for analyst and key business users.

Technical, operational, and business metadata is required as building blocks for Data Governance.
Data Catalogue wise, Apache Hive paired with a data governance tool that can read from Hive will be useful for building a catalogue of assets. The assumption here is, there’s a Spark/Databricks cluster which mount zones in the delta lake for consumption/analytics.
Some tools can read the Data Lake file system, but this would be of limited use.
Data Stewards and the business users who work on the respective areas should make a concerted effort to enrich the data catalogue and correlate the entries to a business glossary.
Data Lineage can be built from data engineering processes and corresponding metadata. A lineage graph would be very useful for impact analysis and troubleshooting. Furthermore, data stewards and business users could add some context to data lineage information.
Data Usage is a key consideration as popularity of data drives better management of data through prioritization of data governance efforts based on popularity.