TDD (Trust Driven Development) in Data Engineering
A practical guide for building reliable data pipelines using Metorikku and Deequ
In recent years, our data platform has experienced exponential growth. Hundreds of new Data sources, including databases, streaming Kafka topics, data warehouse tables, Elasticsearch indices and Parquet files are constantly extracted, transformed and loaded into various formats.
The initial users of the data layer were data engineers writing pipelines in Scala. This approach didn’t tolerate rapid growth and failed to scale. So, as other generalist developers started to gain interest in the data layer, they were offered to develop and maintain their own pipelines. That’s how we introduced Metorikku, a library based on Spark, which uses simple YAML configuration files to define ETL pipelines. Metorikku offers a variety of supported input and output formats such as Parquet files, Avro, JSON and CSV and various sources such as S3, Kafka, Redshift, Snowflake, JDBC and more…
Metorikku provides a clean and simple way to create a pipeline. But what about maintenance?
How do we verify it runs as planned on each designated duration? That is actually writing the data? That the data covers all our use-cases? That the data is of High Quality?
A Brief on Data Quality
“Data is the fuel for intelligent decision making for both humans and machines. Just like high quality fuel ensures that jet engines run efficiently and reliably in the long run, high quality data fuels effective and reliable decision making”.
Data Quality is somewhat of a controversial topic, we all know it’s important, but it’s just so hard and tedious to fix data quality issues that sometimes we just ignore it.
Well, How would you go about fixing data quality issues?
- Discovery — Usually when a data set has been contaminated with bad quality data, we find out about it in retrospect, which leaves the consumers of data already offended by it.
- Investigation — We need to trace back what might have caused this phenomenon, is it affecting something else? Is it in the pipeline itself? Is it the raw data?
- Fixing — We need to fix the problem at the root cause, is this data source in our control? It might come from other teams? Other products? We should upload a fix to ignore this bug until it is fixed by someone else?
There are just too many variables and this process can result eventually in a temporary patch that would later become permanent.
What have we done up until now?
If any of the input data sources are missing or corrupted, we may experience a chain reaction in all the dependent outputs and the most terrifying thing about it is that we may not even know.
We understood early on that these sort of issues are posing a critical failure point in our architecture, so we started to implement various solutions to better monitor our data pipelines quality
We systematically added simple steps to measure specific KPIs such as output size, null values count, id duplications, etc… These values were written to an Influx time-series DB and on top of that, we created dashboards in Grafana to visualize these values.
This strategy allowed us to:
- Create alerts when experiencing irregular values.
- Track the recurring execution of a pipeline.
- Visualize some of the business metrics we find interesting.
This strategy pretty much covers all our needs.
The problem is that it’s super high-friction and requires you to think of the KPIs, write the instrumentation steps, create the graphs and define alerts.
Data Lake / Data Warehouse Monitoring
Using BI tools such as Redash to visualize actual data saved in our data lake/warehouse also proved to be quite an effective solution. It allowed us to monitor, not just a pipeline, but a complete use case. We can compare different tables to find matches or mismatches, to count daily accumulation of rows and basically to monitor data aggregated throughout the entire lifespan of the data sources.
Here we also face a high-friction process that requires the knowledge and maintenance of BI tools, in our case Redash.
This also proves to be quite expensive since we must have an ongoing DWH cluster or query engine to constantly sample the data.
Another solution, proven to be very valuable, is the use of automatic anomaly detections around the data assets with tools such as MonteCarlo.
The main disadvantage of both of these approaches is that we monitor data that has already been saved in our sources.
We may already be in the midst of the cascading disaster chain and we’re not entirely sure which dependent pipelines had already been affected.
Wouldnt it be great if could preempt these violations before they are messing with our entire data sources?
Ok. So what are you suggesting?
In the past couple of years, we have seen the emergence of several open-source tools in the space of data quality.
As new data lake formats are also on the rise, tools like Hudi, Delta designed to help in versioning the data, and transform the state of certain data sources to a given point in time. With ACID like transactions, These tools also can help in reproducing problems and fixing datasets after they have been contaminated.
Another cool tool is LakeFS, which essentially brings a “git-like” experience to your data, helping you run data pipelines on different “branches” and merge the results if they are valid and correct.
Metorikku + Deequ = ❤️
Deequ is an open-source library built on top of Apache Spark for defining “unit tests for data”.
it enables the user to define rules and validations on top of Spark data frames and being used internally at Amazon for verifying the quality of many large production datasets.
We have integrated Deequ as a part of Metorikku, helping to define data quality validations in the same way you would define an SQL step within your pipeline.
We have also implemented a “Log Level” feature, when set to “Critical”, data quality validations can fail an entire pipeline before writing data to its designated output.
Adding Deequ as a part of our pipeline definition will enable you to monitor and validate that the data is of high quality in transit!
- dataFrameName: moviesWithRatings
JOIN movies ON ratings.movieid = movies.movieid
column: rating- dataFrameName: moviesWithRatings
SELECT CONCAT(userid, ' ', movieid) AS userid_movieid
- dataFrameName: moviesWithRatings
This is huge!! We can prevent bad quality data from being written to our production systems and keep a close eye on how the data is behaving in production.
We decided to treat Data Quality validations just like we treat Unit Tests. We envision data quality as being a key step in any data pipeline which is now being shipped with Data Quality checks and validations in place.
This type of strategy will make each pipeline more robust, resilient, and trustworthy.
Hence the title of this blog is “Trust Driven Development”
We have implemented just a mere fraction of what Deequ is capable of, and we tend to add more operators based on use cases such as anomaly detection, normal distribution, and more (Open source contribution are most welcome)
There are plenty of things to do around Data Quality in the world of Big Data.
We are certain that Metorikku + Deequ can help you by starting small.
By adding data quality validations to your day-to-day data engineering tasks you will substantially reduce ongoing crises and data pipeline maintenance.
These and more will help you build a pristine and crystal clear data “reservoir”, instead of a messy and muddy data lake.
I would like to thank the incredibly talented Assaf Liebstein for co-authoring this blog!
 F.Marelli, Build an automatic data profiling and reporting solution with Amazon EMR, AWS Glue, and Amazon QuickSight (2020), AWS Big Data Blog