Datalake — An understanding and approach to extracting value

Vijaya Phanindra
Analytics Vidhya
Published in
5 min readFeb 15, 2020

A decade of Big data & Hadoop and questions are similar, Can Hadoop replace RDBMS? Can Big Data technologies replace DataWarehouse systems? Is datalake the best answer for breaking data silos and breakthrough analytics?

Rather than providing answers (you know it anyway) here is an understanding and approach to building and extracting value from datalake.

Going back to the datawarehouse definition, a data warehouse system is a decision support system that helps businesses with decision making based on aggregated analysis of transactional data from operational systems. Breaking this down

  • Decision Support System — The system supports fact-based decision making for business based on past data and projected forecast if any. The facts are not approximations but as the name suggests they are facts and certain.
  • Aggregate analysis — Most of the analysis would be aggregations, you won’t ask the question ‘What did Joe buy on 14th Feb 2020, 4 PM at XYZ Store”. Your questions typically would be “What is the trend of red color product sales on Valentine’s sale on 14th Feb in the last 5 years and what could be forecast for next year”.

The dominant architectural practices on designing data warehouses Bill Inmon’s and Ralph Kimball’s at a high level differ from having a top-down or bottom-up approach but have a similar view on data warehouse characteristics as defined by Inmon, that “a data warehouse is a subject-oriented, non-volatile, integrated, time-variant collection of data in support of management’s decisions”.

The characteristics of the data warehouse are important in the context of data lake since data lake is now touted as a system for breakthrough analytics.

Comparison datawarehouse characterstics with datalake

We often hear that data is the oil of the 21st-century’s digital economy. The oil that we use in our daily lives, originates first in a crude form from the ground, it is then transported or moved to refineries. The transported crude oil contains a mix of everything and it needs to be refined and transformed before it can be made into a range of products that are useful to consumers.

Extending the analogy of data to oil, data from source systems in its originated form is moved to a centralized repository, where it is refined using different methods, to produce data products useful for business.

A formal definition of data lake would be

A Datalake is a centralized, secured & cataloged repository of data in its source system’s form, accumulated in a cheap storage system. Using polyglot tools, the data in its source format is refined and transformed to produce data products.

Breaking this down

  • Centralized — All of the data in one location, internal organizational data, and external data in one location helps put end to data silos.
  • Secured — Having all of the data in one location gives organizations great power, with great power comes great responsibility, so the data lake is to be secured with strong authentication, authorization, and audit policies.
  • Cataloged — The data in its originated/source form should be at minimum cataloged, this is important for the discovery of data without this important activity data lake becomes data dump. Cataloging involves extracting the schema and versioning the changes in the schema and nothing else.
  • Source System’s form — Data must be kept in its original format as it is originated from the source system. Unlike transactional systems data in the datalake is immutable and read-only. All of the versions of the data are stored, nothing is updated, nor changed or discarded. In qualifying the input data sets, it’s better to use the term “source system’s form” than native or raw, the terms native/raw/basic don’t indicate if the transformation is applied or not.
  • Accumulated in a cheap storage system — As the data collected currently and in the future will be huge it is important to have a dead cheap storage system along with a tiered storage system.
  • Using polyglot tools to refine and transform — Thousands of products are extracted out of the crude oil after refining and each product will have a specific extraction process and that is why the refining industry will have huge factories of several football fields size, while the upstream drilling industry doesn’t. No one tool is ideal or the ultimate solution for turning data in the data lake to a data product. The choice of tools will depend on the outcome expected and in the below order of low to high in terms of barrier to use.
SQL — for business analysis, this is a must and minimum, the success of datalake implementation depends on this, ignore this at your peril.Python/Java/Machine Learning libraries — for advanced processingAPI’s — for automation & operationalization
  • On-demand and autoscale computing — Avoid long-running infrastructure/clusters, the computing power to be acquired, used and discarded after each experimentation.
  • Data Product — the outcome of a data lake experiment is a data product. The data products, in turn, provide breakthroughs for organizations. A data product can be
* A Machine learning model
* A probabilistic, non-deterministic model of the data
* Aggregated analysis that can feed to data warehouse systems
* A data quality assessment of the organizational data
* A curated dataset that is cleaned, transformed and corrected for inconsistencies.
* A segmented dataset to be used for target marketing or a study.
* An operational process to generate a machine learning model.

The Dont’s of datalake

  • Don’t transform/change/correct inconsistencies in source data as it arrives in datalake, KEEP IT IN ITS SOURCE FORMAT.
  • Don’t do business analytics reporting directly from datalake, use compute power to prepare datasets and feed to your data warehouse system.
  • Don’t build star schemas or snowflake schemas on datasets in datalake.
  • Don’t postpone security design when designing datalake, security design must be integral to datalake design.

Most often, the analysis of data in datalake is similar to finding a needle in a haystack. For each of the experiments, there is a chance of discovering gold and sometimes nothing, so keep costs low for experimentation.

The datalake is the playground for low-cost experimentation that can result in breakthrough data products (a predictive model, new insights, etc), and is a competitive advantage for the business if done rightly.

Disclaimer: All the opinions expressed are my personal independent thoughts and not to be attributed to my current or previous employers.

--

--

Vijaya Phanindra
Analytics Vidhya

I am a Cloud and Data Architect and I write about tech (data analytics, data products, real time streaming analytics), career development and decision making