Bad data is expensive
1. Organizations believe poor data quality to be responsible for an average of $15 million per year in losses - Gartner research2. 60% of data scientists admit that they spend most of their time cleaning and organizing data - CrowdFlower3. It costs 10 times as much to complete a unit of work when the input data are defective as it does when they are perfect - the Rule of Ten by Thomas C. Redman
Why is bad data so costly? It is hard to detect, difficult to fix, and every so often impossible to trace. It stays idle and only reveals itself in case of incidents. Data engineers, data analysts, managers, decision-makers use the same data at some point along the pipeline. When transferring data from one place to another, it carries a hidden cost of correcting and fitting data to a specific use.
Thomas C. Redman defines in his article on Havard Business Review a model of “The Hidden Data Factory”. When Department B requires some data from Department A, they spend extra time to correct and modify the data to their proper use. This extra step can potentially add other types of errors that are invisible to Department B but noticeable to others.
Data-driven decision-making processes are extremely sensitive to bad data. But once in a while, we stack layers of erroneous information to our data without even realizing how much it can cost us in the long run.
Bad data is the root of all evil
Bad data not only costs us money, but it also consumes human resources. It takes place in our data centers and blends in with good data to falsify our data processing systems. In our business, bad data is the root of all evil.
How quickly can we react?
On an April day, the number of audiences on all campaigns were blown up on the Reporting tools. The incident was acknowledged as a data-related issue. This time the impact was not about money, but the user experience was gravely disturbed. The unforeseen number of users were visible in one of Criteo’s advertising management platforms.
The flawed data were identified in two hours on the same day. This part of data was chained up to many reporting pipelines, thus the whole data flow will be affected. The routine started to take place: emergency ticket created, team leaders got pinged, and responsibilities were asked to be taken.
Nothing suspicious was spotted out in the systems that produced the initial data. We understood that we must understand the root cause at the end, but to react, a quick fix was proposed. The team that manages the data agreed to refill the inaccurate data with the adjacent hours. We then had to backfill the rest of the pipelines in strict order. Backfilling implies the action of recreating the history data to correct an abrupt set of data.
Inappropriate data can equal to money loss or client dissatisfaction. But as long as we understand the root cause and swiftly put on a concrete action plan, we’ll have everything under control. The above story is one in various examples of a typical data-related incident we have to deal with on a daily basis.
At Criteo, we know how critical data can be to our business. Our predictive models ingest hundreds of terabytes training data. It learns how to deliver an outstanding outcome to our clients’ advertising campaigns. Our data analysts execute ad-hoc queries or make use of reporting pipelines to answer clients’ needs.
Understanding how vital data infrastructure is, we don’t hesitate to roll up our sleeves and get our hands dirty to keep a decent architecture. We maintain our own data centers, build our in-house data pipelines, and try to keep a viable consistency across the system.
We have dedicated teams to regulate data flowing through the whole system. The data observability and transparency enable us to trace back bad data and identify contagious databases to isolate. We create adequate data models suitable for each use case, data integrity monitoring, and inconsistent quality tracking tools.
We don’t neglect the value of data transparency, and we do not internally hide our data behind security walls. We hope to provide a documentation layer to the existed data ecosystem to ease the examination of a distinct data set.
We have some textbook procedures when it comes to handling data-related incidents. A quick response code that requires stakeholders to take actions under a limited period enables us to promptly determine the data origin. The sole bottleneck is to know “the right people to ask for”.
In the near future, we would like to focus on improving the data quality monitor to move it closer to the sources and also defining better the data ownership.
Data governance is to add a security layer
We are the SRE Core Data & Governance group composed of four small teams with a handful of data engineers. Our main products range from reporting data to data documentation. Data quality is on the list of our top concerns to maintain a healthy data system.
We started the initiative by defining a clear vision for data quality across Criteo R&D:
- A definition of data quality
Data Quality ensures that data is fit for its intended uses in operations, decision making, and planning.
- The current status of data quality efforts on each data producers and data consumers’ entities.
- A long-term roadmap on how we SRE Core Data & Governance will attempt to propagate our Data Quality guidelines and best practices to other Criteo R&D squads.
Big Data Quality at Criteo
What is Data Quality and why it matters more than you think on Hadoop.
At the time of this writing, we have almost 1000 consistency checks and more than 1500 stability checks actively running on our Core Data Models. The Core Data Models (or CDM in our colloquial language) consists of approximately 200 tables used for reporting purposes.
CDM nowadays provides Criteo data users with a stable, robust, and reliable data source for reporting. Our tables go directly to client reports, performance dashboards, or campaign management centers. One of our core challenges is to maintain high data integrity. It is critical in our job to keep those data clean and accurate.
Besides the data quality checks, we also have monitoring tools such as Tableau dashboards to visualize those checks result for daily surveillance. We also expose our works along with our data documentation tool to help end-users easily navigate through datasets.
Maintaining reliable data pipelines is not all about monitoring. As data has become an indispensable asset, big tech companies now treat data as a part of the development cycle. Think software, but for data. Deployment, debugging, observability. There are more and more promising initiatives coming from tech players across the globe. Each one has its method to combat bad data, but they share a common goal of improving user experience.
Monitoring Data Quality at Scale with Statistical Modeling
Good business decisions cannot be made with bad data. At Uber, we use aggregated and anonymized data to guide…
As someone who works closely with data, one of my duties is to question the data we consume and ensure the quality of the data we produce. I’ve been intrigued by how we can better supervise our data quality. I’ve come up with a framework named triple As a.k.a the answers to “What, How, and Then?”
- Alert: “What is wrong with the data?”. The more we monitor our data quality, the quicker we should be updated with abnormal circumstances. An alerting system consists of multiple means of communication: email, Slack channel, JIRA tickets, etc. Via these communications, we hope to stay by one step ahead of the data-related incidents.
- Action: “How to do in case of incidents?”. Either to quickly deploy a fix or manually fill a missing period. Not only the actions but the documenting also counts. Our successors depend on how we systematize our responses to bad data detection.
- Anticipate: we get alerted, and we correct the issues, and then what? How do we prevent a similar event to occur in the future? That’s not a straightforward question. This requires an understanding of the root cause, and how far we can go to fix the erroneous data.
By failing to prepare, you are preparing to fail — Benjamin Franklin
We have little to no tolerance for poor data quality. We aim at constructing fundamental principles to create a security layer between our data and the client-facing presentations. Today we seek to enrich our force with enthusiastic data engineers to join us in the engagement against bad data.
The risk of having bad data will always be present, but the most suitable solution is to be prepared. We understand no one can avoid bad data because good and bad data are just two sides of the same coin. There is no bad data by its nature, only an inappropriate context.
Don’t forget to head over to the latest contributions from the Data Team Criteo Labs to the Medium community.
Technical Data Roadmap: Why and how to build it using a maturity matrix?
A technical roadmap is a plan of short-term and long-term goals to support a technical vision.