What is Data Quality?
“Garbage in, garbage out”, who hasn’t heard that in their career? This saying is believed to date from the early days of IT. Fortunately Data Quality is as old, having emerged with good development practices and software testing.
What is Data Quality? Let’s start with an example. There are chances you had at some point in your life a coding exercise like (if not, let’s pretend): “Write a function that computes the monthly payment for a loan”. After the function declaration, what are the first statements? Following best practices, the first statement will check input variables data, for example, that a loan amount is a positive number. That’s already data quality!
Is this kind of check still relevant when switching from the exercise to a real-world implementation? What could go wrong, when the loan amount is stored in a database, where data type is enforced, retrieve through an API function that a developer seating next to you just wrote? Well, a seasoned developer knows that a million things could go wrong. There could be a loan data model refactoring at some point in time, impacting the way data is retrieved. Data stored in the database could be corrupted. A bug could be introduced in a later version and the API send only -1 as loan amount, etc… So instead of computing negative loan monthly payments -and sending that out to the bank clients-, you know that any input data should always be checked for quality.
Data Quality is here to ensure that data is fit for its intended use.
It also means that Data Quality depends on the process at hand, like for any SLA. It is expected to have loan monthly payment correct 100% of the time. A marketing company could settle for having 70% of a prospects addresses file correctly. Data Quality expectations vary according to the use of the data.
So, if developers have been implementing checks on Data Quality in their code for decades what’s the issue? Well, Big Data brings in a new paradigm compared to relational databases.
At Criteo we’re using Hadoop as our Big Data storage, which is a key component of our infrastructure. We own and operate two Hadoop clusters, with a total storage capacity of more than 450 PB. Daily, dataset growth is ~90 TB, and 300k+ MapReduce and Spark jobs are executed.
Data on Hadoop is used in three domains: operations, decision making and machine learning. Data quality is critical for all three domains. Poor data quality of operations leads to inadequate customer service and loss of clients. Poor data quality for decision making can lead to taking unsuitable decisions and ultimately having a negative impact on business & investments. As for machine learning, algorithms are only as accurate as the data used for training them. Poor data quality leads to poor performance in our Criteo customers’ campaigns.
At the same time, Hadoop is an especially permissive data storage system. It does not enforce any of the standard DBMS integrity constraints like domain, entity, integrity, referential or key constraints. Plus Hadoop is used at Criteo as a vast data lake, accessed by many teams to read, process and write data, with multiple sources, from relational databases to streaming pipelines. Bugs can appear unnoticed and generate data quality anomalies that will propagate through multiple pipelines. As more than 30 teams are working with our Hadoop clusters, not only bugs, but also on-purpose changes on data introduced by a team can affect other teams downstream. Impact of such changes can go unnoticed for a long period of time when proper checks and alerting are not in place. Another risk category is the assumptions on data distribution, for machine learning, which can become wrong in the long term.
So, in order to ensure that data is appropriate for its intended purpose, we implemented systematic checks and advanced monitoring. For example, for data used in decision making:
- Data Quality expectations have been translated into SLA, and materialized in 7000+ automated checks on completeness, consistency and accuracy. These checks run each time partitions are created in our reporting data lake, that is to say, every hour. Two sources of statistics are used to apply Data Quality checks: Hive default statistics on tables, partitions and columns (number of rows, min/max of numeric values, count of NULLs…), and, post-queries launched after the insertion of the data, computing custom aggregations of metrics at table level (sum, count distinct…). From these statistics, outliers detection methods are applied to compute expected values and highlight anomalies. Additional sets of checks validate metrics consistency, whenever a given metric is available across different data sources and pipelines.
- The team operating the reporting data lake, which is also the on-call team, monitors the result of these checks through TV boards and get alerted if needed. This very same team has the ability to create new checks or amend existing ones, in order to continuously aim for a better SLA coverage while avoiding false positives -false positives have the short term effect of wasting investigation time and the long term effect of affecting the system credibility, ultimately leading to its disuse.
Catching data anomalies when building the reporting data lake is good, but not enough. At that point, little can be done to fix the data. The correct data value is most likely unknown. Even if it is known, re-generating past data partitions with the correct value can be a long and complex process. Therefore a solution often seems to circumvent the issue, with instructions of the like “To compute June number, take into account data from table ABC as data from table XYZ are corrupted”, and the equivalent in SQL queries, that become hard to maintain. Even with a good data documentation system, these solutions do not scale properly.
Long term solution implies having bad data detected and handled at the source. In order to do that you’ll need to foster communication between data producers and data consumers. A commonly agreed Data Quality SLA should be defined. Once in place, data producers can start taking actions to enforce and monitor Data Quality expectations.
To that end, data lineage tools are powerful enablers. In complex environments -Criteo reporting data lake has more than 170 data pipeline jobs- they provide a way for data producers and consumers to relate to each other. They also are key for Data Quality anomaly investigations.
Enforcing Data Quality at Criteo scale, with one of the biggest Hadoop clusters in Europe, was no easy thing. Bad data has so many impacts that we cannot ignore it. Addressing in systematic manner issues that were mostly discovered by chance is a necessary step to ensure trust in the data we produce and we use. Tools help to detect issues early and minimize impacts, but in the end, ensuring a culture of Data Quality across teams is the key winning factor.
This article has been written by Lucie Bailly and Lionel Basdevant.