Data Lakes — how to enable Advanced Analytics and Machine Learning

Published in

shipzero

5 min readJun 13, 2019

Each and every organization seems to need one nowadays. It’s as hip and trendy as it gets in the data management domain, but what is all the buzz around data lakes about? In this post, we’ll have a look at the most important use cases as well as some of the main charecteristics of data lakes — and thereby figure out why we absolutely need one in some cases and may want to avoid it in others.

What is a Data Lake?

To get started, we need a common definition. I like the one by Gartner as it is on point and easy to grasp:

“A data lake is a collection of storage instances of various data assets additional to the originating data sources (…) stored in a near-exact (…) copy of the source format.”

So, we’re not speaking about pre-processed data marts here but rather a raw copy of your operative systems. But why do we need that when we already have the data in the source systems?

The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store. It should be the single, silo-less repository of all types of data and thereby a one-stop-shop for seamless analysis across all data within an organization.

There are basically two types of data lakes: Hadoop-based data lakes and relational data lakes — whereas Hadoop is far more common nowadays.

Comparison Data Warehouse

Must businesses already have some kind of a Data Warehouse (DWH) in place — whether it deserves the name in some of those cases is another topic. But isn’t a DWH also some kind of a Data Lake? Pradeep Menon described the difference between a Data Lake and a DWH nicely in his Demystifying Data Lake Architecture post:

“Data Lake stores data in the purest form, caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.”

The most important point for most business users is thereby this: Data Lakes store data that is usually not consumed by end-users. It should be pre-processed first, e.g. in the forms of a DWH or data marts. The distinction lies within the way of processing the data, in DWHs it is usually ETL (Extract Transform Load) whereas in Data Lakes it is ELT (Extract Load Transform).

There are hundreds of blog posts and articles on those difference — if you want to find out more on this issue, KD-Nuggets, Talend and Grazitti are some useful sources.

The only thing that really should stick here is: NO, they are not the same thing — but YES, both have a right to exist as they are optimized for different purposes.

Advantages / Disadvantages

So, lets get back to our Data Lakes and why we need them. The critical point from a business point of view is data availability. As the overall speed in businesses increases, traditional architectures often fail in keeping up with the changes. This leads to no or at least very inefficient data-based decision-support. With Data Lakes in place, management decisions can always be guided by timely, quality analysis.

The main technical benefits are scalability, you can connect new data sources quite easily, and flexibility as structured / semi-structured / unstructured data can be stored and made available for analytics. With this schema flexibility we are already at the stage of using the concept of Data Lakes to enable Advanced Analytics and Machine Learning methods.

Alright, scalability, flexibility and real-time analytics all sound amazing — why shouldn’t everyone have one? Because you just don’t need one if you don’t have a volatile business environment. You just don’t have the data sources that force you to do big data or real time analysis. When you need a reliable, retrospective (with e.g. daily batch loads) reporting solution with complex calculations and aggregations: stick with a Data Warehouse.

Breaking down the Hype

Is all this promotion of Data Lakes actually substantial? I’d argue yes, because data is the most critical asset when it comes to making practical use of Artificial Intelligence. And in this context, it is not only about the amount of data but also its quality, depth, and availability.

On the one hand, plain wrong data obviously impacts analysis quality — but so does duplicated, outdated or falsely labelled data. On the other hand, missing data depth e.g. in the terms of metadata is an issue as descriptive data often gets lost when extracting it through ETL processes. Once this information is lost, it can only get recovered from the source systems (if it is still available there).

Often it is even advisable to acquire external data to enrich own proprietary data or provide benchmarks for critical business areas. You can take a look at our post on strategic data acquisition to find out more on the importance of and the approaches on getting the right data.

Lastly but maybe most importantly, real-time advanced analytics and machine learning concepts become feasible when we use a Data Lake in combination with streaming processes within a lambda architecture. In a nutshell, Lambda architectures are a combination of 2 data streams to enable real-time analytics as well as predefined batch processing.

The concept is described nicely in this brief introduction to data processing architectures.

Key Takeaways

Data Lakes are designed for a fast ingestion of raw, detailed source data to enable on-the-fly processing abilities for exploration, analytics, and operations
Data cataloguing and governance are critical for a successful, sustainable implementation
Data Warehouses and Data Lakes are complementary — both have a right to exist for their specific use cases

Sources:

https://www.gartner.com/it-glossary/data-lake
https://medium.com/@rpradeepmenon/demystifying-data-lake-architecture-30cf4ac8aa07
https://www.kdnuggets.com/2014/06/data-lakes-vs-data-warehouses.html
https://www.talend.com/resources/elt-vs-etl/
https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for/
https://medium.com/appanion/strategic-data-acquisition-6aa351d91ffb
https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb
https://aws.amazon.com/de/big-data/datalakes-and-analytics/what-is-a-data-lake/
https://tdwi.org/articles/2017/03/29/executive-summary-data-lakes.aspx