One major challenge in data pipeline implementation is reliably testing the pipeline codes. The outcome of the code is tightly coupled with data and the environment and this consequently blocks the developer to follow test-driven development, identify early bugs by writing good unit testing, and release the code via CICD with confidence.
One way to overcome the reliability challenge is to use immutable data to run and test the pipeline so that the result of ETL functions can be matched against known outputs.
Obviously, this requires a good knowledge of the application and how well the data matches business requirements. Also required are some set-ups to enable the developer to focus on building the application instead of spending time on the environment preparation.
This blog-post focuses on providing a model of self-contained data pipelines with CICD implementation. …