The Biggest Problem in any Data Environment: Data Quality

Seth Goldberg
Charting Ahead
Published in
3 min readJun 28, 2017

--

As a data professional, you can come across a lot of obstacles in your path. It could be a lack of proper tools, access constraints, sheer data volume, or even prioritization. But there is one pesky problem that is often mentioned, but rarely acted on: data quality.

Data quality issues come in a variety of forms, ranging from incorrect values (e.g. listing someone’s birthdate as 9/26/2123) to missing data. Although these issues might serve as a minor annoyance to a software developer building an application, these issues can make or break your analysis/product as a data professional.

Let’s take a look at this graphic:

Source: https://briankegels.files.wordpress.com/2012/07/boehmlaw.gif

Holy s#!t

Ladies and gentleman, the graph above might give you traumatic flashbacks to your Algebra 2 high school class, but this “exponential growth” curve with the y = x² formula, conveys several important messages about data quality:

  1. As bugs causing data defects are introduced into the applications, they grow exponentially
  2. It costs exponentially more $$$ to fix said data defects

Although it might sound like all is lost and there is nothing you can do, there are actions you can take to stem the effects of these pesky issues. First, get management involved. Management has to be behind any effort to address data quality. Without it, no effort to fix it can succeed.

From a technical standpoint, I would advise you take a holistic approach employing the following methods:

  1. Catch defects before they make it into production. I’m lumping automated unit and integration testing into this. This is probably the single most important action you can take since it is the most cost effective and the easiest way to combat data quality issues.
  2. Implement a data quality solution. This seems pretty obvious but you will want some way to catch and *hopefully* correct any data quality issues, ideally on the fly.
  3. Monitoring your progress towards data quality nirvana. You’ll want to meet with key stakeholders to prioritize fixes but also to make sure that issues you believe to be resolved are actually resolved. Ideally you’ll want to quantify the issues so you can measure progress.
  4. Evangelizing the importance of data to the organization. Make the development team(s) and your stakeholders understand the importance of data. Once they truly understand its importance, fixing your data issues will become much easier.

I want to jump into bullet one a little bit more since it is by far the most important one. There are many different ways you prevent data defects from even making their way into your applications. Getting developers to write better automated unit tests is by far the best way to prevent these issues (and if you look at the above graphic, much cheaper than being caught in quality assurance testing). The more the developers care about the quality of the data, the better the results you will get. There are a variety of tools out there that help accomplish this, but the most popular seems to be the xUnit framework for unit testing. Not only is unit testing important, but integration testing is also equally important. This type of testing makes sure that at a high level, the code is doing what is supposed to doing. A popular free database testing framework that can be used to do this is DBFit. Another avenue of attack is your database layer. Leveraging constraints like not-null, primary key, and foreign keys can be a simple yet highly effective way of ensuring high quality data.

Although fixing data quality issues can seem like an insurmountable task, there are a variety of techniques that can be employed to attack the problem. The key to it is getting everyone involved and invested in the success of obtaining accurate, reliable data.

--

--