Build a Data Quality workflow to monitor the Data Lake health status

Davide Romano
MDS-BD
Published in
2 min readJun 19, 2022

This is a short excerpt from the original article published on Towards Data Science blog.
Link to the official blog post:
How to monitor Data Lake health status at scale

At Mediaset, the Data Lake is a fundamental tool used daily by everyone who wants to get some company insights or activate data.

By definition, a Data Lake is “a centralized repository to store all your structured and unstructured data at any scale. You can store data natively from the source without having to transform them on the run” (AWS). This allows you to keep a vast amount of raw data that you can activate later with different types of analytics.

The number of Mediaset employees who access the Data Lake is rising, and as it grows, the number of products and data we ingest and persist. As users and the volume of archived data increase, the complexity of managing the Data Lake grows.

This is a critical point, in fact if Data Quality and Data Governance systems are not implemented, the Data Lake can turn into a Data Swamp: “a data store without organization and precise metadata to make retrieval easy” (Integrate.io).

Data Swamps can happen quickly and create problems for data-driven companies who want to implement advanced analytics solutions. If data is closely governed and constant data health status checks are executed, Data Lake has the potential to give the company an accurate and game-changing business insights tool.

Having full control and knowing what to expect from data persisted in the Data Lake to prevent it from turning into a Data Swamp becomes critical and relevant every day more and more.

For this purpose, we implemented a Data Quality Workflow to support different business units in the demanding task of data validation, keeping track of data health, drifts, and anomalies. This helps people to constantly evaluate both ingested and transformed data, enhancing the data’s trustability and general quality of data-driven products.

The Data Quality workflow architecture (Image by the Author)

To be continued…

Read the full article published on Towards Data Science where we present the architecture of the Data Quality Workflow and list the critical points that brought us to build it and the entire involved technology stack.

--

--