Qualifying Data Quality

Published in

Acerta

5 min readMar 10, 2020

Written By Jean-Christophe Petkovich, Mahmoud Salem & Sergey Strelnikov

It’s been said — though it’s unclear by whom — that, “Quantity has a quality all its own.” The idea is simple enough to grasp: put enough of something together and the way it’s evaluated changes. The sorites paradox (How many grains can you remove from a heap of sand before it stops being a heap?) is the classical example, but there are innumerable others. Think of the first conscripted armies or the rise of mass production.

In the latter case, the quantity of greatest interest today is data, or, more accurately, information. Data quality is a subject of increasing concern, not only among data scientists but manufacturers as well. As the cost of sensors continues to fall, more industrial processes will be instrumented, and factories will generate more data.

Of course, not all data is created equal, and so it’s worth asking (i) what defines data quality (ii) what are the symptoms of bad data, and (iii) if your data is bad, what can be done to improve it.

What is Data Quality?

While the meaning of the term ‘data quality’ is still contentious, we define data quality at Acerta in terms of reliability, traceability, and completeness. Data reliability is by far the most important. It represents the combination of accuracy and precision, such that if your data is reliable, you can be confident that it’s telling you the truth. To put in more concrete terms, you can rely on the fact that when you have a signal that tells you the voltage for a process is five, it is indeed 5V.

Data traceability concerns visibility into the exact source of a piece of information. Detecting the early indicators of future product failures is only useful if you can tie those indicators to the relevant processes on the factory floor. Suppose you find that variation during an upstream finishing operation is correlated with tolerance stack-up issues in transmission assembly. If the data you’ve been collecting doesn’t distinguish between two separate machines performing the same finishing operation, you’ll still have some work to do.

It should be noted, however, that if we’re talking about the source of some data in terms of database vs. data lake vs data warehouse, Acerta is agnostic. From our perspective, how the data is stored matters much less than the conditions under which it was obtained or its informational content. This brings us to the final component of data quality: completeness. The simple fact is that if you want to predict failures, you need all the relevant information regarding what came before and after failures occur.

In the context of machine learning, invoking a notion of relevance might seem to risk dredging up the frame problem, but really it’s just a question of having the right domain knowledge. If we don’t know the process map for a dataset we get from a manufacturing line, it will probably take longer to analyze. If we don’t know the procedures for reworking a transmission — for example — our assumptions about what we’re looking at could be incorrect. Of course, having data that’s both reliable and traceable can go a long way toward understanding a process map.

Symptoms of Bad Data

The most important thing to remember in discussions of data quality is that it’s not so much a matter of good vs bad, as it is bad vs worse. That’s partly tongue-in-cheek, but it can be helpful for remembering that there is no such thing as perfect quality data. It’s a bit like evolutionary biology, where an organism’s “fitness” isn’t an inherent attribute — fitness has to be defined in relation to a particular purpose.

Nevertheless, there are some signs that are almost universally indicative of poor quality data.

For example, if you calculate the missing values and find that the percentage is relatively high, that’s a symptom of bad data. If there are metadata issues — your column names are confusing or just plain bizarre — you might have bad data. Similarly, if the information content seems “crazy”, that’s another symptom of bad data.

We’ve seen all these symptoms at Acerta — voltages labelled as braking forces, GPS data suggesting vehicles are travelling three times around the Earth in less than an hour, velocities fluctuating so rapidly that the change in kinetic energy would cause the vehicle to explode. In fact, our data scientists have built tools to check the sanity of client data specifically for cases like these.

Improving Data Quality

If you’ve made it this far in the post, you might have found the preceding examples of bad data uncomfortably familiar. For those in manufacturing who are worried about the quality of their data, the good news is that there are simple steps you can take to make improvements. Volume is the most obvious, as a good dataset is generally a large one.

However, you should avoid falling into the trap of thinking that all you need to do to improve your data quality is to collect a lot more of it. This goes back to the distinction between data and information. The amount of information in a dataset depends on the amount of variation in the signals; if you have signals that don’t vary a lot, they’re probably not information-bearing. Think of it this way: the volume of data in the Magna Carta and a document with the word ‘buffalo’ repeated 4,478 times is the same, though they differ massively in information content.

A better approach to improving data quality is to start by asking who will be using the data you’re collecting and to what purpose. Once those stakeholders are identified, allow them to guide your data collection. Ideally, you’ll have a “data governor” (or maybe a Speaker for the Data), who can advocate on behalf of data quality.

In one of our best experiences with client data, we received a separate document that explained all the columns, the conditions during data collection, and even some observations from their engineers. In other words, better labelling is one of the best ways to improve data quality. To that end, movement toward OS and middleware standardization in the auto industry — from organizations such as GENIVI — is a welcome development.

Qualifying Data Quality

What is Data Quality?

Symptoms of Bad Data

Improving Data Quality

Written by Brian Ngo