Why data quality matters?

How to make sure that your big data is also good.

Published in

namR

6 min readMay 13, 2019

At nam.R, our job is all about data: finding, cleaning, creating, aggregating, disaggregating and qualifying data. We know that data is no longer confined to technicians and specialists. We have already entered a new era where data is used by every company, at every level and in every department : marketing, management, sales… That’s why our job is not only about dealing good data, but also about making it easy to manipulate and to understand. Starting from day one, we decided to bring a special focus on our data quality.

Okay, maybe not from day one, but very early in nam.R’s history.

Let’s go more in depth why data quality matters : because our clients use our data and because we use it ourselves.

Understanding data

Have you ever tried to open a dataset from the open data world? Well, if you have, you know the struggle of understanding what the dataset is really about: What is its scope? What are the units? What makes a row unique? Why does this tree have a height of 800km (yes, it is a real example, form a real dataset from the open data, two times higher than the ISS)? …

Furthermore, data quality is also important when you train machine learning models. A machine learning model may reach fantastic performances, but if it is trained on data of poor quality, its prediction would be meaningless.

No one wants to spend time trying to understand and correct the data, it is painful and difficult. Data value lies in the fact that you can understand it and use it easily. Good data is the foundation of good decisions.

Trusting data

There is another issue that is extremely important in data quality: the confidence you have in your data. Remember that our job is to provide data to our customers, and that our data comes from various sources and in different shapes: public administration, unstructured data (we are really proud of our computer vision and natural language processing teams!), academic and industrial partnerships, … There is no reason that we should trust each single piece of data in the same way. So we have developed a custom confidence indicator that can tell our client: “that’s some good dope!”, or “neh, at least it exists” and everything in between. Also, this confidence indicator is a proof of reliability towards our clients. We are data specialists, we can provide the data you need, but we will always tell you if you can fully trust our data, or if you need to be cautious about it.

illustration of our confidence indicator: Buildings with tile roof are detected with a high confidence, but the “unusual” building (a swimming pool in this case) is detected with a metal roof, but with a low confidence

Using data

Furthermore, we also use our own original data. And for a lot of reasons, we want to have it as clean, easy to use and easy to understand as possible.

We have to be able to trace our data. Let’s say we provide data to one of our client. That same data will be constantly improved. Our client comes back to us with questions and issues about it. We have to be able to trace our data flow and understand exactly which data was provided and where the issue may come from. Versioning, archives and documentation are important. Even when your data is clean. And one could even argue that the cleaner the data is, the more you need documentation.
We have to perfectly know our data flow. When our data sourcers (yes, we have data sourcers, and they are damn useful!) find a new version of an existing dataset. We want to know which ones of our data are to be updated, in which order, and which one of our customers may be impacted. Knowing your data flow is key.

Illustration of nam.R’s typical data flow.

As a start-up, we are growing quickly. For that reason, we want our collaborators to know how to handle our database as fast as possible. The simpler the database, the faster people get to be efficient.

Data quality epic fails

If you are not yet convinced by the importance of data quality, let me illustrate it with two historical examples.

In 1999, NASA’s Mars Climate Orbiter burned up in the Martian atmosphere because engineers got confused in the units (metric vs barbarian imperial). A small mistake that cost $125M.
More recently, in decembre 2017, Lactalis (a multinational dairy products corporation) detected a risk of salmonella contamination on infant nutrition products. As a consequence, they had to start a recall procedure (legally binding). Lactalis was then responsible — with a result obligation — for: identifying the concerned batches, identifying the concerned stores and reseller, ordering the withdrawal of all concerned products, making sure all the products were actually withdrawn, informing its clients or making a press release.

Note that if they were to keep selling these products, that would be considered as an aggravated deceit, a criminal offense. Traceability is a component of data quality (what did you sell? to whom? When? Where? …) and it has tremendous importance.

Poor data quality can lead to higher cost in operations, communication, legal affairs,…

Defining data quality

I have explained why data quality is important, but I have not really been into what it really is. There is not one simple definition of data quality in the literature, but people working on data quality issues tend to agree on what is part of it. Here is a brief summary of the main common dimensions:

Relevancy: Is your data useful for what you intend to do?
Completeness: What is the coverage? Do you have the appropriate amount of data?
Accuracy: Is your data correct? Precise enough? Free of error? Valid?
Consistency: Is your data reliable?
Understandability: Can you interpret the data? Is it easy to understand?
Timeliness: Is your data up-to-date?
Reputation: Can you believe your data?

Of course, there are many more aspects to data quality, depending on who you are and what you use data for. After spending a significant amount of time studying the literature, we got the strong conviction that there was no magical recipe to follow.

Data quality is a tool that you have to forge to answer your own business needs.

namR’s data quality definition

Thus, we decided to build our own data quality definition based on our needs, and the needs of our clients. And the results are summarised in the following table:

Good data quality, then what?

Having a good data quality is not the end of the story. It is not something you do once, and then forget. You have to maintain it, make your collaborator understand its importance, and make them contribute.

You also have to make data quality evolve with you activity, the indicators you follow today may not be the ones you will need to follow tomorrow. And then, maybe, you won’t end up crashing satellites in the Martian atmosphere.

Why data quality matters?

How to make sure that your big data is also good.

Written by namR