Flattening the Data Quality Mistrust Curve with DataOps
A lack of trust will dramatically impact your efforts to become data driven unless you proactively limit the spread of mistrust from data quality incidents.
It only takes a small problem to shake someone’s trust in data, but it takes a lot of deliberate effort to make them realize it was just one problem, not a larger issue. Even mature data organizations run this risk, as it is impossible to fully eliminate all data quality issues. This can lead to the rapid spread of mistrust throughout your organization, unless we adapt some lessons from the world’s efforts to flatten the curve of the COVID pandemic.
Data Mistrust is already Endemic
Despite the fact that nearly 98% of organizations are making major investments in their ability to become data driven, data quality still costs the average organization $15M per year in bad decisions according to Gartner, and impacts 90% of companies according to Experian . While I covered a few data quality horror stories in a prior article, it isn’t common for a data quality problem to bring down a whole company. Additionally, there are now many modern tools (SodaData, ToroData, Trifacta) and practices (primarilly DataOps) that are making the application of data quality best practices much easier than they once were.
However, I don’t think I’m going out on a limb to state that Data Quality will never be a solved problem. I also think that the biggest impact, by far, of data quality issues is not the direct impact of decisions based on poor data, but the loss of trust across an organization in the data and the data platform that serves it up. Especially in organizations that have been impacted by bad data quality in the past, oftentimes repeatedly, it doesn’t take a major event to start the “hear we go again” messages from echoing across the organization.
In fact, an IBM study showed that fully one-third of business executives don’t trust their data, and they believe that on average 32% of their data is incorrect. It’s nearly impossible to estimate the cost of a reversion to gut-feeling decision making, but it’s likely quite significant. Another study, this time from Forrester, found that a third of Data Analysts spend over 40% of their time reviewing data field-by-field for quality issues due to a lack of trust. It’s likely that these time costs are not factored into the $15M of hard costs mentioned above.
The R0 of Mistrust
I find it hard to believe that a third of the data in an average organization is bad. A more reasonable number is probably somewhere around 10%*, although this likely varies significantly across data domains. The 22% gap between the 10% of actually bad data and the 32% of assumed bad data is a result of the spread of mistrust through the workforce.
In epidemeology, R0 (pronounced R-Naught) is the number of new people likely to be infected by a single infected person. In case you’ve been living under a rock, the following clip from the 2011 movie contagion does a great job explaining it:
Just like in the explanation from Dr. Mears, there are many factors that go into the R0 of mistrust in data. It is a simple fact that fear, uncertainty, and doubt spread quickly beyond the initial person who has experienced an event that shakes their trust. In fact, the tag-line from Contagion was “Nothing Spreads Like Fear”. We will get into more details on the factors in the next section before discussing the mitigations at the end of the article.
As we’ve all learned over the past few months, a strategy that relies on containment alone is doomed to failure and will likely yield a pandemic of mistrust. Unfortunately, nearly everything that is written about data quality focuses on preventing, detecting, and quickly correcting data quality problems. To be clear, if your organization is not able to quickly address data issues, then you should definitely invest in that area before or along with the techniques introduced in this article. However, a truly successful data organization will have to accept that data quality issues will always arise, despite our best efforts to prevent them, and shift from an attitude of prevention to an attitude of prevention and R0 minimization.
What leads to the R0 of a Data Quality Incident
A few years ago, I was responsible for developing a Failure Modes and Effects Analysis (FMEA for short) for a very complex system that was processing massive-scale IOT data. I found the systematic approach to be effective at improving the reliability of the system. For background, FMEA was first developed in the 1950s by the US military for use in Nuclear Weapons programs, then adopted heavily by other organizations responsible for complex systems (NASA, Aircraft & Auto manufacturing, Semiconductors, Healthcare, etc). It starts with detailing a system’s components (or processes steps) and subcomponents, then lists the ways each subcomponent could fail, before examining the downstream effects of that failure and recommending a control mechanism to prevent or minimize the impact of the each failure.
Let’s breifly look at the example that is on the top of everyone’s mind these days: the COVID-19 infection. The human immune system is one of the most complex systems we know of, but in the case of the novel coronavirus, it is clear that there are several failure modes that lead to hospitalizations and deaths. The doctors and scientists of the world have been collaborating to document the process by which the disease progresses from exposure to infection, to spread in the lower respiratory tract and blood vessel linings. Within each of these steps there is a sub-process where the immune system’s response fails to stop the next step from occurring, leading to a cascading effect. In some cases, the immune system appears to be over-responding, leading to the “cytokene storm” that is blamed for many deaths.
The world’s doctors are also working on controls at each of these stages:
- Social distancing to prevent exposure to sick patients
- Masks to reduce the risk of infection when exposed
- Antiviral drugs like Remdesvir to reduce spread by reducing replication of the virus
- Steroid treatments to reduce the effect of an overactive immune system
Applying FMEA to Data & Analytics
While these techniques were designed for machinery and processes where effects are somewhat deterministic, they provide a useful guide for analyzing the complex system that is an organization undergoing a transformation to become data driven. Starting with the primary function of the data & analytics organization (this is my generic definition extracted from many client’s vision statement): Enable the business to derive the maximum amount of value from its data assets, typically by aiding in data-driven decision making, we can begin to enumerate failures.
- A piece of invalid data is included in a data product that causes an invalid decision to be made. For example, a customer satisfaction metric was calculated incorrectly for a small subsection of customers, making it seem as if unhappy customers were very satisfied. This leads your company to send them additional cross-sell advertising rather than following up, further alienating these customers.
- The consumer of a data product detects that one of the reported data points is invalid (i.e. the same scenario as before, but someone in marketing notices the problem before sending out the ads).
- Multiple data products represent the same conceptual thing differently, and the consumer does not know which one is correct. For example, an operations report defines active customer as one who used the system in the last week while a billing report defines it as a customer who is paying for a subscription.
- A consumer of a data product is unaware of the requirements that led to transformations between the source and destination, and assumes it is incorrect. For example, a marketing report properly filters out any customer who has opted out of communications, but a customer success rep for an opted-out customer believes that the report must be wrong because their customer is missing.
There are many more potential failures than we can cover in the span of one article. The key is that, in each of these scenarios, the impact at the initial fault point is not very large. However, left unchecked, they can start a cascade of issues that cause a dramatic impact through poor decision making or the spread of mistrust throughout your organization. Therefore, the goal of maximizing the value derived from data is served by a comprehensive approach to mitigate the spread of mistrust.
Approaches to reduce R0 of a Data Quality Issue
To stay on theme and make these easier to remember, I’ll tie them to the treatment stages of an infectious disease: Diagnose, Trace, Treat, Vaccinate. Similar to disease, the effectiveness goes up across these options.
Intake and Testing, Diagnosis and Quarantine of a Data Issue
This refers to the capture of the issue by someone with the knowlege and authority to take on the steps below. Ideally, these issues would be detected by monitoring within your data platform before any data consumer is made aware of them (see the observability movement within DataOps for tips on how to achieve this). However, it is impossible to catch every problem, and as the above examples showed, a few escaped problems that go undetected can lead to an outbreak of mistrust. Therefore, it is critical to give your data consumers a voice, and that they trust that when they use their voice to raise an issue, it will be addressed. This allows them to vent their frustration in a productive way, rather than starting an echo chamber with their peers who are equally powerless to make things better. In effect, it quarantines the impact to the data consumer who experienced the issue.
It is critical to give your data consumers a voice, and that they trust that when they use their voice to raise an issue, it will be addressed
To enable this quarantine approach within your organization, make sure every data consumer in the organization has access to a single place to raise data issues. Equally important, they must also know how to use this tool. Finally, they must see that their issue is being addressed, which leads to the rest of the steps. Ideally, this will be directly tied to your data catalog to facilitate a quick transfer to the next stage of outbreak prevention.
Track, Trace, and Inform the impacted Data Consumers of an Issue
Once an issue has been found, it is extremely important to let those who are going to be affected know. This is often counter intuitive in a culture where admitting failure is feared. If the goal is to keep the default mode of thinking in data consumers to trust rather than distrust, it is best to tell people when there is something to worry about.
It is important that the description of known data issues is made visibile to data consumers in their daily workflow.
Going back to the coronavirus metaphor, because it is unclear when an overactive immune response will cause problems for a given patient, we have all defaulted to always worrying that the disease could be fatal to us, and let fear drive our response. If we knew this information, most would take it into account in planning their personal pandemic response.
Because of this, it is important that the documentation of data issues is made visible to people in their daily workflow. This is easiest if you provide a consistent location to find data quality notes on a given data product. For example, having your data consumers launch their BI reports through your data catalog provides you the opportunity to surface any data quality warnings on the report page before they open the report. However, many people will likely bookmark their most important reports and launch them directly. This means that you will need to push an alert directly on to the reports (ideally in a consistent location) to get their attention and direct them back to the details in the catalog. Obviously, this technique only works to instill trust if your data pipelines are mostly reliable, so invest in that area before you spend too much time on documenting data issues.
Cataloging what should be trusted is equally as important as documenting what should not be trusted for now. One of the most frustrating aspects of the coronavirus is that it can spread from asymptomatic patients, which forces us to treat everyone if they might have been infected. Because you don’t know, your reaction has to be driven from the worst case scenario. If instead, you knew that the people you are interacting with recently had a negative test, you could make informed decision as to what to do.
In the DataOps world, this involves surfacing the data tests that were run to prove out a given pipeline before it was deployed into production, as well as those that run ongoing to ensure the data meets key expectations. Using BDD style naming on your data tests and showing the test status alongside the data quality notes and warnings gives your data consumers the power to decide what to do based on a deep understanding of the current state of the data.
Treat the issue and show that it is fixed
Go beyond simply correcting the issue to increase the trust that it has been corrected
Cataloging and raising the current state of data, both good and bad, helps shift the default mindset of data consumers from distrust to trust. The final piece of documentation that is needed is the documentation that actions are being taken to make things better. In this way, having a work ticket system to track the data issues works better than simply capturing the issue directly as a note in the catalog.
The key is blending transparency with simplicity — that provides the consumer with the knowledge of what happened to the data they are viewing without getting too deep into the technical weeds. Once you have conviced data consumers that you are transparent, they are far less likely to spread mistrust by asking all around the organization to figure out if that issue they found last week has been corrected.
This brings us to the next stage in minimizing the spread of mistrust, which is rapidly correcting issues. The key here is to go beyond simply correcting the issue to increase the trust that it has been corrected. Consider the following trust epidemic scenario: a marketing analyst finds an issue with customer addresses and asks 5 people about who to report it to, then fires off an email to the data engineer responsible for the customer address pipeline. The engineer realizes that his latest change caused a formatting issue, fixes the problem, and emails the marketing analyst that it is fixed.
Now, even though the analyst trusts that the addresses are correct, the 4 other people she talked to don’t know about the fix (or even the specifics of the issue) and are likely to trust address data less. If your data platform provides a robust tracking and tracing function, and your data team practices test driven development, then these 4 people would be able to see that the new test that was added when the engineer fixed the problem is passing the next time they go to view their report. Additionally, they could see that there are no open tickets against the address data, restoring their trust without needing a direct conversation with the data engineer.
Vaccinate to create Herd Immunity
Unlike in infectious disease, recovering from a data issue does not prevent it from it occurring again by default. Therefore, the final step is to stop issues from coming back by curing the systemic problems that lead to the data issues in the first place. This is definitely the most complex undertaking, but it also has the highest reward.
You should start by performing trending and root cause analysis on tickets to uncover systemic issues. Next, link issue tickets to “systemic fix” tickets, which will often include the development of new data platform features, but can also include culture / training / process improvements within your DataOps operating model that will catch issues earlier in the development lifecycle.
Examples of these types of improvements are in my other articles:
- Standard data quality rules applied across master data based on schema tests, (see the Data Quality with DataOps article).
- Improved collaboration between analytics teams and app dev teams that control data capture (see the article on Empowering Data Owners for more details)
- The creation of a Data Catalyst role to increase the detailed knowledge of data product development teams in source data (see the Data Transformation with Data Catalysts article)
Providing visibility into these types of improvements is not as simple as the correction of data issues or the capture of data tests, but it isn’t impossible either. You can start by documenting your data processes in a way that can be linked to and from the data catalog and tickets. Using a flexible process diagramming tool that can be embedded directly into the catalog enables the process changes to be documented alongside the systemic data issues that they are meant to address. You can then run micro-learning sessions on the new way of working, record it, run it through AI transcription, and embed the recording and transcription in the catalog as well.
You can use the same technique to capture and embed training on the proper ways to use the catalog to find the information above and put this in your catalog home page. To drive cultural change, you can show the overall trend of data issues raised directly within the data catalog that serves as the “Front door to all things data”. As this KPI trends downward, highlight the systemic fixes that are going into your organization’s approach to data that are driving the improvement. From there you can start a data stewardship program that more directly incentivizes people to improve data documentation and quality.
By now, nearly everyone on the planet has gotten a lesson in epidemiology. Far fewer have recognized that mistrust in the data assets that their organization is investing in spreads in a remarkably similar manner. Using that knowledge, we can apply a data strategy lens to the same techniques that are used in the medical community to determine how to respond.
As always, I’m open to discussing these thoughts with anyone else with a passion for improving organizations ability to extract value from their data.
*I could not find a study that did rigorous reporting on the percentage of data in organizations that is actually bad. The closest thing was a Gartner study from 2007 that states that 27% of Fortune 1000 companies data is “flawed”, but their definition of flawed includes duplicated data. Let’s assume that this trend has continued and that duplicate data is 17% of the 27%, leaving 10% truely bad data. This means that, on average, the trust gap is nearly double the amount of bad data.