Are You Sure You Have Good Data?

Best practices for detecting bad data before it spreads

Published in

HomeAway Tech Blog

8 min readApr 3, 2019

Photo by Mika Baumeister on Unsplash / help by Dinosoft Labs from the Noun Project

In an age when HomeAway processes petabytes of data on a daily basis, data quality is critical to ensuring the right decisions are being made with our data. But how does one know if their data is good? If polluted data gets introduced into an upstream data source, then downstream data sources will get polluted as well unless that bad data is detected. How can you ensure your data is good?

In a former role, I maintained a database that generated a metric which affected the annual bonuses of over a 100 people. I had no system of truth to validate my metric with because my database was the system of truth. In this blog, I will share the practices I developed in that role to ensure I had 100% accurate data for my VP and his organization. I will also share how HomeAway is integrating these practices into our own data ecosystem.

Automated validations

To ensure you have good data, it’s critical that you have systematic validations checking for bad data. That is why we have created a validation engine using Java and Spark for detecting bad data. This validation engine enables us to run a number of standardized validations in our extract-transform-load (ETL) jobs every day. The engine was also built to enable us to deploy new standardized validations quickly whenever a need arises.

Run your validations in your ETL

While it’s great to have automated validations, if you don’t execute those validations during your ETL then you risk allowing polluted data to enter your ecosystem. This is why it’s critical to embed your validation checks in your ETL so that your validations stop your ETL immediately whenever bad data is detected. Our validation engine enables HomeAway to define and execute validations in our ETL so that we are alerted as soon as bad data is detected.

Have the right validations

There are many different ways to validate data. All validations are not created equal, so it’s critical that you have the right ones to help you detect bad data. I’ll discuss five critical validations used in our validation engine.

Right Validation #1: Dataset Comparsion

A common comparison used for validating data is when one compares the results of an ETL directly with its source by running the same query that was used to pull that data. We call this kind of comparison Dataset Comparison. While this comparison is useful in certain cases, there are two fundamental problems with this comparison that make it poor at validating data.

1) If you change your ETL SQL then you must remember to update your validation SQL as well. Otherwise, you will get false alerts about your data with this comparison.

2) Dataset Comparison is not good at capturing garbage-in garbage-out (GIGO) data issues. Let’s take the extreme scenario of all the data in your source table being deleted. With this type of comparison, your source query will return zero rows and your results query will also return zero rows. That comparison will result in a match and therefore no alert will be raised even though you just pulled data from a completely empty table.

Dataset Comparison is best used when you have a long data pipeline and you want to ensure that all the transformations between the beginning and end of the pipeline haven’t altered the data.

Right Validation #2: Snapshot Comparison

In order to catch GIGO data issues, a better validation to perform is what we call Snapshot Comparison. This comparison stores an aggregated snapshot of your data every time your ETL runs and then compares that stored snapshot with a stored snapshot taken the previous day (or previous week for weekly loads). (It’s important to note that the snapshots persists even when your data is completely overwritten with fresh data.) When you compare the two snapshots on a date grain within the snapshots, the data should closely match.

Here’s an example of how a snapshot comparison would look when comparing a snapshot taken on 6/6 with a snapshot taken on 6/7. Both snapshots capture the last five days of bookings, which means only four days of bookings are in common between the two snapshots. Bookings data does experience some slight variations as you can see below, so in this case, the deltas would be considered acceptable.

Example of how snapshot comparison works

If the source table had had all of its data deleted, Snapshot Comparison would notice that today’s snapshot had no data but yesterday’s snapshot did and therefore would raise an alert for inspection.

Right Validation #3: Trend Comparison

But what if your data source is updated incrementally? Snapshot Comparison will not work with incrementally updated datasets because historical data will always match perfectly. Therefore you need a different comparison, one we call Trend Comparison. This comparison calculates a user-specified trend on previously-loaded data for comparison with the newly-loaded data. If the freshly-loaded data falls outside of that user-specified trend, then an alert will be raised for review.

In the example below, Trend Comparison calculates the row counts of the five partitions preceding the partition currently being loaded. A mean for row counts is calculated and then compared with the current partition. If the new partition falls out of supplied number of standard deviations for upper and/or lower limit, then an alert will be raised.

In this example we used row counts but our validation engine is flexible enough to calculate a trend on any measure(s) with any standard aggregate function.

The downside to Trend Comparison is that sometimes there are valid reasons for newly loaded data to fall outside the trend and therefore a false alert is raised. In a world where data quality is paramount, false alerts are a small price you gladly pay for high quality data.

Right Validation #4: Unique Key Validation

One validation that is crucial in a cloud-based object store world is checking for duplicates in your dataset. In a relational database like SQL Server, simply adding a primary key to your table is all you need to do guarantee there are no duplicates in your dataset. Unfortunately, you can’t do the same when writing data to a file in the cloud; therefore you have to check programmatically and verify uniqueness during your ETL run. Our validation engine includes a validation called Unique Key Validation that saves the data engineer time and effort from having to write the same duplicate-checking logic over and over.

Right Validation #5: API Validation

The final validation that is crucial for data quality is API validation. This validation is similar to Dataset Comparison except it validates the results of an API with a source dataset to ensure that data flowing through the pipeline is not getting altered.

First, you pass to the validation a dataset that you want to use for comparison with the API results. This dataset should have grain values and measures that correspond to the API results.

Next, you pass the API url with placeholder(s) for grain values to be inserted so that a complete API call can be generated.

https://internal.api.blah.com/api-call-to-validate/v1/<|unit_id|>

Our validation engine will retrieve the first 5000 rows from the comparison dataset and plug the grain value (in this case…unit_id) into the API call. JSON results will get returned by the API for each individual call that look like this.

[{“unit”: “111”,”booking”: “1050”},{“unit”: “222”,”booking”: “3000”},{“unit”: “333”,”booking”: “5010”},{“unit”: “444”,”booking”: “2000”}]

The JSON results are parsed in Spark with a user supplied SQL statement that converts the json fields to names that match the comparison dataset.

SELECT unit AS unit_id, booking AS gross_booking FROM dataset

The validation engine then compares the api dataset with the comparison dataset for discrepancies.

Validations must include detail to research issues when they arise

Validations work best at catching issues when your data is aggregated at high grain. For example, comparing row counts from day to day is a good way to determine if you have any wild fluctuations in your data. However, when your validation finds an issue at an aggregate level, it’s also helpful if you have detail data that supports the aggregate view to help you diagnose what happened with your data. Whenever we run a Snapshot Comparison, we run two at different levels of detail: One at a high grain to catch data issues at an aggregate level, and then a second Snapshot Comparison at a low level with detail so that we have visibility to what the data looked like the day before (which has now been overwritten by today’s load).

Our validation engine in action

Over the last summer, HomeAway began to integrate our validation engine into its data pipeline. The below chart shows a breakout of the number of validations integrated into one of our department’s ETL.

As a result of the principles in this blog being applied, our validation engine has helped HomeAway catch many potential data quality issues. In just one department, they had 60 variations of the above validations running throughout their ETL every day. Out of the 3600 validations executed over the first two months (60 days * 60 validations), the validation engine alerted them around 100 times about potential data quality issues they were able to correct on the spot before resuming their ETL.

Conclusion

Bad data happens, so it’s important to have an ecosystem of checks and validations that alert you immediately whenever bad data arrives. It’s also important that you have the right checks and validations to inspect your data with so bad data doesn’t slip by. If your data engineers generating data for you follow the practices I’ve outlined here, they will greatly reduce the amount of bad data entering your ecosystem.