Data Platform — Validation

By — Shubhanshu Garg (Engineer, Data Platform)

Published in

Urban Company – Engineering

6 min readJul 29, 2020

We are currently on our journey towards solving for data availability with platform-as-a-key vision. Here we will shed some light on how we are ensuring that the data processed via platform is reliable and of high quality, which is fit for driving business decision.

Why Data Validation Framework?

We built our data platform to be versatile, where anyone could come to our platform and onboard new sources (MySQL/Mongo) to the platform.
Since all this is self-service, new use cases keep on being added without us being in loop.

This calls for a key platform capability — a validation framework, which is automatically enabled for all new and existing use case, and is able to report any loss of data in the pipeline.

Let’s use a real world analogy to understand why this data validation is of such importance.

Going through this COVID pandemic, we hope that we are able to develop an effective vaccine for the disease. Once this vaccine is approved, it would have to be sent from the country of origin/production to various other countries.
Now, since the country of origin (just like source MySQL/Mongo databases) also requires the vaccine, this vaccine would have to be replicated and transported (data ingestion and persistence) to other recipient country (data analytics layer/data warehouse).

However, if there is some error in the replication or the transportation process, it would lead to a very scary situation for the citizens of the recipient country as an imperfect vaccine could have deleterious effects.

In the data world, having imperfect data is equally scary. It can cause huge economic loss as well as bad customer experience.

If there is no one ensuring the reliability and validity of warehouse analytics data, marketing, product and business decision making could be seriously crippled.

What Should Data Validation Framework Do?

Data has two main aspects — rows and columns. Hence, our data quality definition should involve both of them. We have defined the following three key aspects that we want the validation framework to address:

Data completeness: This means that all the rows of data that are there in source are present in destination.
Data correctness: This means all the columns for each row at source are captured correctly.
Easily extensible: Framework should be able to support multiple transactional sources, and should have easy extension capabilities for new sources.

For data completeness, we need to have a row count validation between destination and source which is the most basic validation that should be there.

For data correctness, along with the rows, the column value matching should also be addressed. Solving correctness completely would mean very heavy computation, each column would need to matched with a source. We can optimise this by identifying few key columns which when matched contribute to majority of correctness.

Let’s understand this completeness and correctness using the following example.

In the example, if we do an id and a count comparison for time range of 2020–06–19 to 2020–06–23, both of them have same count 2 and matching ids (1 and 2). So, data is complete if we see from rows perspective.

But ask yourself, “Is the data correct?” If you see for id = 2 , updated_at and name column are different as the row has been updated later. This is where matching both id and updated_at column provides an additional level of accuracy.

We took the above pointers and came up with a generic and scalable DataPlatform — DataBase Difference (DP-DB Diff) framework.

How To Solve For Data Validation? DP-DB Diff

Before we see how to solve, let’s have a quick recap of Spark and Spark DataFrames :

Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of circumstances.
A Spark DataFrame is basically a distributed collection of rows (row types) with the same schema.

To solve for data validation, we built a framework using Spark as computation engine. We chose spark because our use-case requires a comparison on multiple columns for multiple sources in small time frame. In absence of distributed computing, it would have taken us more than a day to complete one comparison run.

Coming to actual validation logic, for each source, we maintain and compare the following metadata for source:

Primary key column: Name of primary key column for table/collection
Updated At column: Name of column which stores record’s last updated time

We create spark data frames, one for source and one for destination with primary key and updated column values.

When we compare these two data frames, we get a diff data frame which is capturing difference in both primary key and updated column. This helps in:

Completeness: Validating that all the primary keys matches the source ensures that all the rows are being correctly replicated.
Correctness: Validating that last update of each primary key matches ensures that no update is missed . In absence of this, it might be possible the we have a stale state for the source record in destination.
Extensible: To handle multiple transactional sources , we have created an abstraction using an interface which can be extended for different sources. Since warehouse data computation and diff computation remain same for every source, all we need to do handle a new source is write a new implementation for that source and it is covered.

This is the interface we have and, at present, we have implemented it for two main sources: MySQL and Mongo. If we have a new transactional source in future, we can easily extend it.

Data Validation Consumption

Dp-Db Diff framework provides following key metrics at each event level:

Diff Count: The count of difference which provides information on how many id updates were lost .
Diff Percent: The percentage of difference which provides a more accurate picture of loss impact as absolute counts can be misleading.
Sample Ids: Three sample ids whose update were lost — this is super helpful in debugging and uncovering the reason for loss.
New Event: A flag to set recently on-boarded events apart from older events, so that urgency of fixing the difference can be determined.

We consume the above data points in two forms:

Diff visualisation: All computed diff information is stored in a MySQL table. We used Redash, an open source tool to build a visualisation for last 7 days which makes it easily actionable.
Diff mail: While visualisation helps in identifying sources which have validation issues, we have also set up a daily mailer which provides us with insights so that the data loss can be easily debugged.

Conclusion

Our data validation issues have reduced a lot since we build this framework.

This framework has helped us in proactively fixing data issues as compared to the reactive approach we had been using earlier, where users would come to us after business delay/impact had occurred because of incomplete/incorrect data.

We now feel closer to our mission of providing correct and complete data to all the stake holders.

Please share your thoughts about this blog in the comments.

About the author:

Shubhanshu Garg leads Data Engineering. They work on the foundational layer for data systems and are the owners of the data stack.

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (UrbanClap Blogger) . Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).

You can read up more about us on our publications —
https://medium.com/urbanclap-design
https://medium.com/urbanclap-engineering

https://www.urbancompany.com/blog/humans-of-urban-company/

If you are interested in finding out about opportunities, visit us at http://careers.urbanclap.com