Data You Can Trust

Published in

When I Work Data

7 min readNov 12, 2020

The unloved problem of Data Science

Over the last decade there has been rapid innovation in the field of Data Science. Open source projects abound to tackle its numerous technical challenges. Distributed compute solutions, machine learning libraries, visualization tools — you’ve got dozens of options to choose from. However there remains a core problem in Data Science that still languishes in the hinterlands of our collective interests: How do you get data you can trust?

The answer is simple if not informative.

Demand it.

Easy to say does not mean easy to do, but the fact is that the typical model of a data science team puts very little emphasis on the data itself. Data is treated as an input while dashboards, reports, models, predictions, and insights are the outputs. This is an unproductive conceptualization.

Here on the When I Work data team we have a different perspective. The first and most important output of the data team is the data itself. If we can do only one thing well, that is what we must do. Trustworthy data drives a company forward by enabling everyone to directly engage with data, using it to answer their own questions. It removes us as a bottleneck and allows us to move more quickly. Data is our top priority, everything else places a distant second.

This is peculiar though — if data is an output, then what is the input? This is where language leaves us lacking. Not all data is Data — bytes on a hard drive does not equate to usable information. So when I refer to Data, I mean data that has:

A detailed and enforced specification describing structure, types, and allowed values.
Thorough documentation enabling users not directly familiar with the source material to understand and use the data correctly without consulting additional resources.
An accurate grounding in the facts that it claims to represent.

Our primary responsibility is to ensure this data is trustworthy. It is our first output and the foundation of everything that follows after. This pursuit is as much a human challenge as it is a technical one. It starts with getting a shared understanding of why trust is such a critical problem.

Untrustworthy for Now, Untrustworthy Forever

We ask people to do very big things with data. We ask team leaders to allocate weeks of person time based on data. We ask product owners to redefine their vision of a product based on data. We ask executives to change the direction of an entire company based on data. When people make these decisions they do not do so lightly. At times they are acting against their own intuition in favor of trusting what the data is telling them and if that trust proves to be misplaced it is a lesson that will be learned quickly. Once trust is lost opportunities to rebuild it don’t come easy.

If you’ve worked in, or even just with, a web application you’ll be familiar with production issues. Someone pushed a bad change, servers are crashing, customers are calling, it’s all hands on deck. The bug gets identified, patched, and a fix deployed — things start working again. Large failures tend to have relatively neat edges, a start time and an end time. This is not the case in data.

How do we tell when data has a production issue? Data problems tend to present themselves in subtle ways, results that are just a bit unexpected. When we interact with an application we know what we are expecting to see, and know when it is wrong. When we interact with data we don’t really know what to expect. The difference between a data integrity issue and a real world event doesn’t look incredibly different.

This lack of clear observability poses a significant barrier to rebuilding trust once a failure has occurred. When I see an application fail I can very easily determine when it has recovered. When I see a data integrity issue, I don’t have that same ability. I don’t know what the data is supposed to be, so I can’t have that same certainty that the issue has been fixed.

Having even a single moment where the data couldn’t be trusted undermines the ability of real data to be used effectively. It allows for the people to apply their own speculation against the facts at hand. When our aim is move companies with data, we need to enable complete confidence in the data. Without it, data becomes no more useful than tea leaves.

Now that we understand the importance of trust in data, how do we go about demanding it?

Push Quality to the Edges

There needs to be a line dividing good from bad, not a gradient. The When I Work data team maintains a data lake and within that lake we don’t tolerate second-rate data. Data is either a full member, or not a member at all. There is no place for data waiting to be improved. There is no holding spot for data dumps in case we might want to pull something out of them later. Tolerating second rate data gives the illusion of data without having data in fact. It allows data to be collected as an afterthought. Good data demands intent — we must be proactive in planning what data we want. If we haven’t decided what is important about a data set at the time we are collecting it, we don’t have any hope of knowing we got the important stuff right.

Once we’ve committed ourselves to maintaining a high quality data set we create a specification and enforce it at the perimeter. The specification defines how the data will be structured as well as any additional claims that can be made about the data. This is akin to an API, providing a static target at which to aim and a border to the data lake. When a source provides data that fails to meet the specification we make no attempt to recover, it is simply rejected.

Require Documentation

Undocumented data isn’t data at all. In order to make data impactful, we must be able to tie it to its real-world meaning. While we can and should take steps to make data self documenting, we should also demand support through explicit written documentation.

Documentation is a tricky thing. In one of my first jobs out of college a senior engineer described “documentation” as just a synonym for “lies”. While I don’t share the exact same cynicism I do understand the sentiment. It’s easy to allow documentation to drift from reality to the point that it simply becomes a source of confusion.

Our solution to this problem is to couple documentation to specification. The specification of a data set is a functional part of the system, when it isn’t correct things don’t work. By having the documentation live right along side the specification it stays as living part of the system. It is pushed into our awareness by being next to the stuff that is willing to jump out and bite us.

Run Frequent Sanity Checks

There will be times when our data loses its grounding in reality. We’ll be getting data that looks correct, but it just doesn’t have the real world meaning we expected. We should plan for this as an inevitability and work towards early detection to limit the scope of errors.

While we never know exactly what accurate data will look like, we usually have a pretty good idea of what it shouldn’t look like. Write sanity checks that regularly validate these assumptions and raise alerts if they aren’t met. If revenue jumped by 100x, chances are dollars have turned to cents; we assume changes will never happen that fast. If there were no new signups in the last day chances are that there is a data flow issue, we assume that there should always be some level of signup activity. By regularly checking our assumptions we can ensure that our data stays grounded and that when it doesn’t problems can be resolved quickly.

Make Data Issues Urgent

Data issues are easy to discount for the simple reason that their consequences are usually distant. They don’t hurt us today, they hurt us next week or next month. Humans aren’t naturally good at prioritizing these sorts of problems. Knowing that this human propensity exists lets us consciously push back against it. When data is viewed as a first class output, any data issues must be treated as urgent problems.

In practice this means treating data issues in the exact same fashion as you’d treat application issues —someone’s pager needs to go off. If we see data that is failing to meet its obligations we will create an alert for the source of that data. If our sanity checks start failing someone’s phone will go off, morning, noon, or night. While being on-call isn’t fun, it’s a shared responsibility that keeps everyone focused on maintaining high-quality data.

When we react with urgency to any data integrity issue we send a message about the quality of the data we maintain. We let people know that no errors will be intentionally tolerated and that they can trust the data they depend on. In order for data to be useful it must be trusted.

What you can do with trust

Once you’ve invested in making every data set one that you can trust, you start playing a very different game. Analysis takes dramatically less time and you can feel significantly more confident in the results. It becomes easier to be curious because you know you aren’t stepping into a quagmire. The shadow of an unrecognized data error looms smaller in your mind. But the most meaningful change is in how it enables people operating outside of the field of data science to start engaging with data.

When you have data you can trust, analysis doesn’t need to be one team’s responsibility. Trustworthy data enables product owners, executives, and engineers to engage directly with data and to answer their own questions. There is too much knowledge waiting to be uncovered to expect it to flow through a single team. We aim to build a data-driven company and that begins on a foundation of trust.