Why I Built an Opensource Tool for Big Data Testing and Quality Control

Tim Gent
The Startup
Published in
4 min readAug 11, 2020

I’ve developed an open-source data testing and a quality tool called data-flare. It aims to help data engineers and data scientists assure the data quality of large datasets using Spark. In this post I’ll share why I wrote this tool, why the existing tools weren’t enough, and how this tool may be helpful to you.

Who spends their evenings writing a data quality tool?

In every data-driven organisation, we must always recognise that without confidence in the quality of our data, that data is useless. Despite that there are relatively few tools available to help us ensure our data quality stays high.

What I was looking for was a tool that:

  • Helped me write high performance checks on the key properties of my data, like the size of my datasets, the percentage of rows that comply with a condition, or the distinct values in my columns
  • Helped me track those key properties over time, so that I can see how my datasets are evolving, and spot problem areas easily
  • Enabled me to write more complex checks to check other facets of my data that weren’t simple to incorporate in a property, and enabled me to compare between different datasets
  • Would scale to huge volumes of data

The tools that I found were more limited, constraining me to simpler checks defined in yaml or json, or only letting me check simpler properties on a single dataset. I wrote data-flare to fill in these gaps, and provide a one-stop-shop for our data quality needs.

Show me the code

data-flare is a Scala library built on top of Spark. It means you will need to write some Scala, but I’ve tried to keep the interface simple so that even a non-Scala developer could quickly pick it up.

Let’s look at a simple example. Imagine we have a dataset containing orders, with the following attributes:

  • CustomerId
  • OrderId
  • ItemId
  • OrderType
  • OrderValue

We can represent this in a Dataset[Order] in Spark, with our order being:

case class Order(customerId: String, orderId: String, itemId: String, orderType: String, orderValue: Int)

Checks on a single dataset

We want to check that our orders are all in order, including checking:

  • orderType is “Sale” at least 90% of the time
  • orderTypes of “Refund” have order values of less than 0
  • There are 20 different items that we sell, and we expect orders for each of those
  • We have at least 100 orders

We can do this as follows (here orders represents our Dataset[Order]):

As you can see from this code, everything starts with a ChecksSuite. You can then pass in all of your checks that operate on single datasets using the singleDsChecks. We’ve been able to do all of these checks using SingleMetricChecks — these are efficient and perform all checks in a single pass over the dataset.

What if we wanted to do something that we couldn’t easily express with a metric check? Let’s say we wanted to check that no customer had more than 5 orders with an orderType of “Flash Sale”. We could express that with an Arbitrary Check like so:

The ability to define arbitrary checks in this way gives you the power to define any check you want. They won’t be as efficient as the metric based checks, but the flexibility you get can make it a worthwhile trade-off.

Checks on a pair of datasets

Let’s imagine we have a machine learning algorithm that predicts which item each customer will order next. We are returned another Dataset[Order] with predicted orders in it.

We may want to compare metrics on our predicted orders with metrics on our original orders. Let’s say that we expect to have an entry in our predicted orders for every customer that has had a previous order. We could check this using Flare as follows:

We can pass in dualDsChecks to a ChecksSuite. Here we describe the datasets we want to compare, the metrics we want to calculate for each of those datasets, and a MetricComparator which describes how those metrics should be compared. In this case we want the number of distinct customerIds in each dataset to be equal.

What happens when you run your checks?

When you run your checks all metrics are calculated in a single pass over each dataset, and check results are calculated and returned. You can then decide yourself how to handle those results. For example if one of your checks gives an error you could fail the spark job, or send a failure notification.

What else can you do?

  • Store your metrics and check results by passing in a metricsPersister and qcResultsRepository to your ChecksSuite (ElasticSearch supported out the box, and it’s extendable to support any data store)
  • Graph metrics over time in Kibana so you can spot trends
  • Write arbitrary checks for pairs of datasets

For more information check out the documentation and the code!

--

--