ETL is probably the most time consuming part of every Data Science project. The quality of extracted and crunched data is one of the major factor affecting the final results. In facts, real world data is always messy and inconsistent. Data Validation is a must for enforcing the correctness of the proposed solution and to make sure the underlying data represent the true business scenario.
When performing a data validation, the following issues often arise:
- We want to track of how much information we lose for debugging and reporting scopes.
- Sometime we want to cleanse invalid data instead of filtering out.
- Part of the validation logic depends on the project requirements and/or model assumptions. They change often and the re-factoring may introduce bugs.
In this tutorial we are showing how to use monads, applicative functors and other functional programming concepts to safely and elegantly define the validation logic using a modular pattern. Each rule is defined individually and the final logic is built by using two types of composition:
- Monad-composition. One rule after the other, if one fails the next rule is not applied.
- Applicative-composition. All of the rules are applied independently and the validation results are collected and merged together.
Moreover, the data that do not pass the validation tests is not discarded but moved into a separate pipeline with all of the needed meta-data information attached to it explaining why this particular record was discarded. This allows us to:
- Log all of the specific causes of data loss.
- Easily recover previously invalidated data if the validation rules change.
- Re-use part of the discarded data further down in the data pipeline. The whole ETL workflow is aware of what has been discarded before.
Continue to read the original article on: