Data Lake Explained

Stefan Deusch

Published in

Enquizit

6 min readOct 20, 2021

This post explains what a data lake is to help you understand if you should use one.

What is a data lake?

The simplest definition of a data lake is ‘a centralized repository to store unstructured and structured data at any scale’. Data stored as-is can be used for data mining or machine learning, while structured data can be used in different analytics applications like dashboards, real-time analytics to big data processing.

The Cloud is especially a catalyst for adopting a data lake because cloud offerings for storage and compute have become competitively cheap and highly reliable. In 1980, 1 GB of storage cost over 100K USD, today it is 2.3 cents per month in the Amazon Web Services (AWS) cloud.

Cloud providers enable the provisioning of a data lake in minutes. Serverless cloud services charge a pay-as-you-go fee.

The case for data lakes

The amount of data has been growing exponentially over the last few years. The reason is our online lifestyle, from shopping, banking, gaming, working, to social media, etc. on our mobile, car, or desktop devices. Data is easier to capture and already in digital format, e.g., doorbell cameras in homes. Advances in analytics, artificial intelligence (AI), and machine learning (ML) have made it easier for organizations to extract valuable insights from big data.

How does a data lake compare to a traditional data warehouse?

Previously, data warehouses were the only type of central repository with all business data, current and historic. Data warehouse projects need thorough planning, from ingestion to transformation, to data modeling. The data is usually highly transformed and structured. Data warehouse technology is expensive; hence data is only ingested if there is a justifiable benefit.

Data lakes, in contrast, are cheap and easy to scale. Data can be stored in a raw or slightly processed format, without having to determine its final use. All data can be stored as-is, no time and effort need to be spent on optimizing data structures, on translating them to a compliant format.

The transformation of the data happens when we are ready to use it, either as raw data or after an additional processing step to open-standards formats. The data lake approach is known as “Schema on Read” vs. the “Schema on Write” approach used in the data warehouse.

The “Schema on Write” approach makes it much harder to achieve 100% collection efficiency because of the strict formatting requirements, whereas the “Schema on Read” allows all data in any format. If data completeness is a concern, then a data lake is the more efficient solution. Data collected from external sources is subject to change, sometimes in unpredictable ways. Data lakes allow data collection first while “Schema on Write” forces parsing of data before collecting it.

The “Schema on Read” approach is more flexible. It allows adapting to change in the data, and to implement different aspects on the same data. A data scientist can look at demographic data, while an analyst can report on aggregate business metrics.

Components of a data lake

The conceptual components of a data lake are shown in Figure 1. At the bottom is the layer of physical storage (Amazon Simple Storage Service (S3) in AWS). Data is stored in files in a location that reflects the tables it represents. Tables are defined as “Schema on Read” referencing those files.

Figure 2 shows a different concept for grouping the data lake areas by their level of data processing. The ‘Raw’ area is the landing zone for all raw data. The ‘Processed’ area is home to more refined data. Transformations to standard data types (e.g., string, integer, date, time, etc.) and compression to optimized columnar formats save money and speed up queries. The ‘Published’ area contains further refined data, intended for publication to stakeholders. In the Published area, data is filtered down to relevant fields, rolled up by aggregates, enriched from other sources, and formatted a certain way.

Figure 2: Data lake with progressive areas of data processing

Where to get started

When embarking to build a data lake, the use cases of the organization should be at the center of the decision. Is there a desire to build a central repository with all data? Are there too many siloed databases such that integration has become a nightmare? Is there business value in looking at all sources of data combined and at an aggregate level? Usually, there is.

Once a data lake is in the planning, a cloud provider must be selected. AWS, Google, and Azure are the three big ones. The decision should factor in data lake features, integrations with other services, compliance, security, price, and company policies.

Choosing the right provider and designing for the current and future organizational requirements may best be done in collaboration with an experienced technology partner. Enquizit Inc. has delivered numerous cloud data projects and data lakes.

Even with the most sophisticated data lake technologies, challenges still abound around the management and governance, such as:

· how to avoid duplicate data

· how to determine the proper zones of the lake

· how to share data safely, within the organization and with outside

· how to keep performance high

· how to protect confidential data

· how to push changes to production environments

· how to deploy a data lake

· how to test it

Data lakes in AWS

AWS has the services to quickly build a data lake in a secure, flexible, and cost-effective way. The following services are foundational and required for the minimal feature set of a data lake:

· Amazon Simple Storage Service (S3)

· AWS Glue

· Identity and Access Management (IAM)

· Amazon Athena

S3 provides the storage layer. AWS Glue has features to inspect raw data and come up with a schema definition. It also assists in writing extract-transform-load (ETL) jobs that convert raw data to optimized, transformed data. However, its most important feature is its metadata repository with the catalog of databases and table definitions (the queryable data lake). IAM is the ubiquitous service in AWS that manages user authentication and data access permissions. Amazon Athena is a serverless environment to query data in S3 using standard SQL.

The following services add more power to a data lake, but they are optional.

· AWS LakeFormation

· Amazon Redshift

· Amazon QuickSight

· Amazon Elastic Map Reduce (EMR)

· Amazon SageMaker

LakeFormation is a new service to define data lakes through a single interface. It orchestrates all the essential services above, plus it offers an additional security and control layer for data access. Redshift is the AWS solution for traditional data warehousing. It has a Redshift Spectrum feature that allows query syndication across databases (i.e., you can join data in a Redshift data warehouse with data in S3 files in the lake). QuickSight is for business intelligence (BI) reporting, similar to Tableau or Looker. It can run SQL queries and thus visualize the data of a lake using Athena or Redshift Spectrum. EMR is the AWS service to spin up clusters for distributed parallel computing quickly. The term MapReduce was initially associated with the Hadoop framework, but EMR comes pre-installed with similar frameworks such as Spark, HBase, Hive, Flink, and Presto. SageMaker is the kitchen sink service for ML. Both EMR and SageMaker have deep integration with the data lake.

In addition to these AWS services, there are a few third-party vendors who offer additional tools to manage data lakes for more specialized use cases:

· Dremio

· Qubole

· Databricks

· Snowflake

Summary

Data lakes meet the demand for the rising tide of information created by new technologies and business models. They have become possible through highly available, cost-effective storage in the Cloud ($ 0.023 per gigabyte) and advances in analytics frameworks.

Data lakes are cheaper to maintain than data warehouses because of a few critical reasons. First, no prior data modeling is necessary, which allows you to defer decisions to later. You can capture data before a decision on how to use it is made. Next, the storage costs are significantly cheaper in a data lake compared with data warehouses.

Data lakes can support analytics, BI, data science, ML, compliance reporting, all of which can lead to higher quality decisions, efficiency gains, and operational improvements.