Design Patterns for Data Lakes

Lackshu Balasubramaniam
7 min readApr 5, 2020

--

Introduction

Data Lakes are the heart of big data architecture, as a result careful planning is required in designing and implementing a Data Lake. I’ll walk through my findings and thoughts on the topic in this post.

On a side-note, I do love the metaphor of water as data, reminds me of the famous Bruce Lee quote.

Empty your mind, be formless, shapeless — like water. Now you put water in a cup, it becomes the cup; You put water into a bottle it becomes the bottle; You put it in a teapot it becomes the teapot. Now water can flow or it can crash. Be water, my friend.

Moving on, to help illustrate the concepts I’ll use a modern data warehouse architecture as per the diagram below.

Modern Data Warehouse Architecture

When I look at Lambda Architecture or Kappa Architecture, I feel Data Lake does not supersede a Data Warehouse. They each serve their respective functions. The technologies/products highlighted above are representative and could be replaced by equivalent products from different vendors.

Note: Azure SQL Data Warehouse is now known as Azure Synapse Analytics.

In the diagram above there are different layers that fulfil each stage of preparing data for consumption:

  • Ingestion layer that ingests data from various sources in stream or batch mode into the Raw Zone of the data lake.
  • Transformation layer which allows for extract, load and transformation (ELT) of data from Raw Zone into the target Zones and Data Warehouse.
  • The Data Integration capability at the top enables the movement of data.
  • The Data Lake is the storage capability that allows the high-speed processing of data and storage of large volumes of data.
  • The Semantic Layer can be implemented in the Data Lake via Delta Lake or suitable file formats, however, Data Warehouse has far more mature capabilities in modelling and serving the data to users.
  • Analytics and Reporting Layer provides the data consumption capability and the opportunity to build a sandbox environment for experimenting with the data in tandem with the Analytics Zone in the Data Lake.

As can be seen in the diagram above the Data Lake plays a central role for storing and processing large volumes of data.

Benefits and Challenges

The book Architecting Data Lakes expounds the benefits of Data Lakes as outlined below:

  • Schema-on-read (which speeds up data movement and defers applying schema to data).
  • Can store raw as well as refined data
  • Low cost storage and processing
  • Scales at low cost
  • Multiple avenues for data access including SQL-like syntax
  • Ability to handle both Batch and Stream processing
  • Amenable to complex processing

However, the challenges in implementing Data Lakes are:

  • Visibility of data as there’s a risk the Data Lake becomes a dumping ground for data and users get overwhelmed with the deluge of data.
  • Also there’s a real risk of the Data Lake becoming a Data Swamp due to a lack of governance and oversight. A balance between top-down and bottom-up approaches to landing data into Data Lake is necessary.
  • Data Governance will need to be factored in early in the design.

Data Lake Zones

As far as data lake zones are concerned, I would avoid over-engineering them. Start with the basic zones and add more zones as required by the business. Allow the zones and data engineering infrastructure around it to grow organically. The zones I found useful are listed below.

Raw

  • Locked down security wise. Typically, only service principals/accounts and administrators would have access.
  • To store data from source as is.
  • For DB sources, data might be stored as Avro, Parquet or ORC formats. This will allow for partitioning, bucketing and sizing strategies, which will optimize downstream data processing.
  • Depending on regulations that are in place ingestion process could mask or tokenize sensitive data
  • Retained indefinitely and immutable
  • Folders are time based for easier roll forward to downstream zones.
  • I would keep initial (first time) ingestion folder in the same structure rather than in a separate structure. This will keep the ELT process simple.
  • Can regenerate downstream zones

Structured

  • Data for Analysis thus would have a standard structure across the board to be intuitive.
  • Used for a smaller base of Data Analysts and Super Users; individuals who are comfortable with querying data.
  • Data is managed at entity level (not in terms of periodic ingestion) and could be stored in delta or parquet format. I tend to prefer parquet to keep it simple. Delta adds a transaction log on top of the data structures and the capability might not be required at this point.
  • Additional columns are added during the transformation for auditing and lineage purposes.
  • Entities are query-able via Apache Hive for easier consumption.
  • Only entities that are required are loaded into the zone.

Trusted

  • Data for general consumption. Entities are query-able via Apache Hive for easier consumption. Reporting layer could directly connect to Trusted layer.
  • Only entities that are curated are loaded into the zone. Curating data would involve significant data engineering efforts.
  • Enriched with information. Could have entities combined into higher level entities.
  • Organized for optimal delivery. Delta format is the ideal format as it caters for ACID transactions. The technology still needs to mature for easier consumption by reporting layers.
  • Can be used to store data at entity level and aggregated, summarized levels i.e. bronze, silver, gold tables approach. As a result, the zone could be used interchangeably with data warehouse for some cases.

Analytics

  • Exploratory analytics for data science
  • Prototyping
  • Could be pushed back into Trusted upon generating useful insights.
  • This zone is for sandboxes and doesn’t have to be part of mainstream data engineering.

Transient

  • To hold data in transit. Especially when moving data between zones or zone and data warehouse.
  • The storage will need to be cleaned up periodically or as part of the data engineering process post data movement.

Archive

  • Aged data
  • Query-able via technologies like Polybase in Azure Synapse Analytics.

Data Processing

Ideally data processing should be metadata driven for easier management of extract, load and transform (ELT) processes. As per the earlier diagram there is a clear separation of data processes based on the zone where the data is landing as described below :

Data Ingestion

  • This layer ingests data from various sources into the Raw Zone
  • A batch ingestion mechanism like Azure Data Factory (ADF) would be used to ingest batch data sources like databases or file extracts.
  • A stream ingestion mechanism like Spark Streaming/Event Hub/IOT Hub would be used to ingest (near) real time sources like IOT data, click-stream/log events or Database change data capture (CDC).
  • Ingestion processes mask or tokenize sensitive data as prescribed by regulation
  • The data formats will not change apart from database sources where they’ll land as AVRO, Parquet or ORC files.
  • Logging to indicate the data source, row counts, etc.

Data Transformation

This layer standardizes and transforms data for downstream consumption.

  • Rule based and would be driven by consumption models.
  • Identification of security constraints based on user groups.
  • Compliance check for regulatory requirements. Classification of data as public, internal, sensitive, restricted, etc.
  • Entity mapping, column mapping and augmented columns as required.
  • Data quality check, data cleansing and data enrichment as part of curation process when moving to Trusted Zone.
  • Data movement from Data Lake into Data Warehouse should be a seamless process. For Azure Synapse Analytics, Databricks and Polybase allow for easy movement into the data warehouse staging area.
  • ELT processes in Data Warehouse can be driven by Data Vault, Dimensional (Kimball) or both approaches
  • Some data profiling would be useful. Levels of data profiling will depend on time windows for data engineering processes.
  • Logging to indicate business unit, business process, data source, targets, row counts pre and post-processing etc. This data will be useful for data governance.

Data Consumption

Analytics and reporting can be achieved through various tools, the common ones are Tableau and Power BI. With the PolyBase capabilities in Azure Synapse Analytics data can be accessed from where they live i.e. joining data in data lake as external tables to tables that reside in the data warehouse.

Data Lake Management

I feel data management for Raw Zone should fall to IT through metadata and data engineering processes. Structured Zone and upwards, business functions should manage data through data governance processes (working together with data engineering processes).

Some of the considerations around data management are

  • Data quality and consistency for business to use in decision making
  • Policies, standards and regulations around ingesting, transforming, consuming data
  • Security, privacy, and compliance which ties to how data is laid out as well as authentication and authorization of users
  • Data life cycle management which includes archiving data when the data is aged.

Data Governance

Existing business processes will need to leverage Data Catalogue, Business Glossary and Data Lineage to consume the data effectively. These will provide the ability to interrogate the data for analyst and key business users.

  • Technical, operational, and business metadata is required as building blocks for Data Governance.
  • Data Catalogue wise, Apache Hive paired with a data governance tool that can read from Hive will be useful for building a catalogue of assets. The assumption here is, there’s a Spark/Databricks cluster which mount zones in the delta lake for consumption/analytics.
  • Some tools can read the Data Lake file system, but this would be of limited use.
  • Data Stewards and the business users who work on the respective areas should make a concerted effort to enrich the data catalogue and correlate the entries to a business glossary.
  • Data Lineage can be built from data engineering processes and corresponding metadata. A lineage graph would be very useful for impact analysis and troubleshooting. Furthermore, data stewards and business users could add some context to data lineage information.
  • Data Usage is a key consideration as popularity of data drives better management of data through prioritization of data governance efforts based on popularity.

References

Architecting Data Lakes by Ben Sharma (Zaloni)

Data Lake Use Cases and Planning

Data Lakes in a Modern Data Architecture

Databricks Delta: A Unified Management System for Real-time Big Data

Productionizing Machine Learning With Delta Lake

Azure Databricks Architecture on Data Lake

Data Lake Storage Introduction

--

--

Lackshu Balasubramaniam

I’m a data engineering bloke who’s into books. I primarily work on Azure and Databricks. My reading interest is mostly around psychology and economics.