Provenance: the Missing Feature for Rigorous Data Science. Now in Pachyderm 1.1
Pachyderm 1.1 is available today, it includes over 70 bug fixes and new features compared to 1.0. One of those features is provenance, something we consider to be a core function that’s surprisingly absent in other big data packages. This post is about the motivation, architecture and implementation of provenance in Pachyderm.
Before we can talk about the why and how, we should first agree on what exactly provenance is. Generally speaking provenance is something’s origin; my provenance is Ithaca, New York which is where I was born. In the context of data, provenance refers to other data that was used to compute it. For example, if I have a list of sales that my store has made I could compute the total revenue of my store. That would leave me with 2 pieces of data, a list of sales and computed revenue. The sales are the provenance for the revenue.
Provenance captures dependency between data sets.
So why is this concept useful? Firstly, it enables reproducibility because results can be traced back to their origins and recomputed from scratch if need be. If I notice that my revenue numbers are lower than expected, my first step is to look at the raw data and see if anything looks off. Maybe it’s a bug in my analysis in which case I can change the code and quickly rerun it on exactly the same data. From there, I can perform further computations on the raw sales data to learn more and clarify the results. For example I might want to compute my average revenue per sale and my margin; two new pieces of data both of which have my original sales list as their provenance.
Provenance gets even more powerful when it’s used by a group of people collaboratively. Suppose a colleague of mine came upon the data from the example above. She’d immediately be able to see the sales figure, the raw data that had gone into it and the later computations I’d done to clarify them. Thanks to provenance she’ll have all the context necessary to understand what our organization already knows about this data so she can get started doing her own analysis on top of mine.
Users upgrading to 1.1 won’t need to do anything special to take advantage of provenance. Pipelines automatically record provenance as they execute and it’s impossible for analysis to take input without those inputs becoming provenance for the output. This was an early design decision we made; we felt that if our provenance implementation required users to do extra work it would invariably break down in the real world. Knowing 95% of a result’s provenance is often more confusing than knowing none of it.
Pachyderm is different from most large scale storage systems in that it doesn’t just store data, it stores history as well. The architecture and nomenclature are similar to git. Data is organized into repos with commits (snapshots) in each repo representing the state of that dataset at different points in time. Pachyderm implements provenance for both repos and commits. In the above example the datasets: “revenue”, “revenue per sale” and “margin” would each be repos with the “sales” repo as provenance.
Inside of each of these repos is multiple commits which represent the historical states of the data set. These commits are also linked by provenance. For each commit in my sales repo there’s a corresponding commit in my downstream repos. You can think of these commits as a consistent snapshot of the data and analysis at a specific point in time.
This prevents one of the largest sources of discrepancies in data science: people computing results from different versions of the same data set.
All of the provenance functionality comes together into a high-level command called “flush-commit”. Flush-commit takes a set of commits and tracks those commits forward to downstream commits that were created as a result — that is, commits which have the original set of commits as provenance. If we flushed the commit to our sales repo from above, we’d expect to get back a commit from each of the other downstream repos. These commits form a consistent snapshot. Flush is smart about how it creates these global snapshots. Since downstream analysis takes computation time, some of the commits might not even exist when the flush is issued. Therefore, Pachyderm will wait for all the downstream analysis to finish before flush returns and if anything downstream fails it will tell you which step in the pipeline had the error.
Pachyderm is founded on the principle that data tools can and should do more to enable rigorous data science.
We’ve written at length about this topic and what it means to us in our Data Science Bill of Rights. Provenance is a huge step in that mission since it’s both useful by itself and forms the bedrock for later features. We’re hopeful that over time provenance will become a more standard feature in data systems as we feel it’s impossible to do good data science without it.