Self Healing Data Pipeline

Murari Ramuka
Google Cloud - Community
2 min readFeb 12, 2022
Self Healing Data Pipeline

Data Quality is always been a major concern in data engineering process. Below statistics will help to understand the depth of Data Quality issues:

  1. On an average, it costs about $1 to prevent a duplicate, $10 to correct a duplicate, and $100 to store a duplicate if left untreated.
  2. 25–30% of data becomes inaccurate leading to less effective sales and marketing campaigns.
  3. Businesses lose as much as ~ 20% of revenue due to poor data quality.

Now since we understood that Data Quality is a major problem and need immediate attention.

Let me come to “Self Healing Data Pipeline”. Self-Healing pipeline is a mechanism which has ability to auto correct the erroneous data in the data engineering pipeline based on past patterns. This implementation is a unique combination of Data Engineering and Machine learning.

•Self-Healing pipeline is a tool/application/pipeline which can be developed into Cloud environment.

•It is an automated Self Validation Quality Control Data Pipeline process.

•Self Healing Data pipeline can increase the accuracy of data by filling in the missed or incorrect data. It can also take care of some data abnormalities.

How it Works:

  • Based on past occurrences of data patterns, the pipeline has ability of Automatic Correcting the Data.
  • On the path of Creating Data Quality Assurance “Self-Healing Pipeline” will create ML Training Models from Good Set of data and replace Error Data with Accurate data so that no re-processing is required.
  • It can also be clubbed together to improvize automated Modernisation and Data Governance.
  • With Self Healing Data Pipeline, Data Validity and Data Profiling Analysis for complete dataset and even frequency analysis for Data Completeness can be done.
  • It helps in improving Data Accuracy and Data Lineage.

There will be more article coming up for Self Healing Data Pipeline implementation soon. Stay Tuned….

Disclaimer: This is to inform readers that the views, thoughts, and opinions expressed in the text belong solely to the author, and not necessarily to the author’s employer, organization, committee or other group or individual.

Interested in learning more about how self healing data pipeline can be implemented ? Reach out to Murari Ramuka.

--

--

Murari Ramuka
Google Cloud - Community

Data Enthusiast who help in key data driven outcome with Cloud Data Platform implementation