Debunking the FUD about data version control implementations

Joe Doliner
Pachyderm Community Blog
9 min readJul 12, 2021

There’s a narrative popping up lately about an impossible tradeoff between version control systems — is it better to have a diff-based or snapshot-based architecture for a data version control system? But the whole narrative comes from seeing the approaches as separate. The real truth is people haven’t thought about the problem correctly.

In this post I’ll talk about the right way to think about the problem. At Pachyderm, we’ve implemented both types of architectures and moved beyond seeing them as mutually exclusive to create our current system which combines the best of both worlds and delivers something unique that you can’t pigeon hole as either.

Snapshot vs diff architectures

The essence of the snapshot versus diff argument is that version control systems can choose one of two architectures:

  1. A diff based architecture in which the core primitive, a diff, describes a change to data relative to the previous state
  2. A snapshot based architecture where the core primitive, a snapshot, describes the state of all the data at a single point in time.

The significance of this choice is in which operations are most efficient and how much storage bloat is required to keep many versions. A diff-based architecture stores less data because it’s only storing the diff between changes. It’s faster at returning diffs (unsurprisingly, since it stores them directly,) but slower at retrieving complete snapshots since it must construct them by applying diffs. A snapshot-based architecture stores more data because snapshots are redundant. It’s much faster at retrieving a given data state since it can get all the data at once, but slower at returning diffs since it must construct them by diffing two snapshots.

At a very basic level this tradeoff is true and something users should care about, but once we start to optimize these systems, the distinctions get a lot less clear and the two systems blend together.

Diff-based system adherents will probably take exception to my claim that diff-based systems need to apply diffs to return a snapshot. While that is generally true, it implies that diffs must be applied from the beginning of time. In modern systems, diff-based systems cache intermediate results of diff application so that only a small number of diffs need to be applied to get the most recent snapshot. Another way of thinking of this is that they compact many smaller diffs into an uber-diff that contains all of the changes of the smaller diff and can be applied quickly to a blank state.

Ultimately there’s very little difference between an uber-diff and the snapshot you get from applying that diff. The thing to notice though is that we’re gradually sliding toward the other architecture by storing snapshots with redundancy. It’s not totally the same, because the snapshots are cached objects, they’re not the source of truth. So we can prune them, or store them in ephemeral storage since they can always be reconstructed on the fly. But reconstructing them is the expensive thing that we want to avoid, so we do need to store them for some amount of time to get any value at all from them. Which snapshots you cache is the fundamental question for optimizing diff-based systems.

Next up, snapshot-based systems. Snapshot aficionados probably took exception to my claim that snapshots are redundant. Again this is generally true, but implies that each snapshot is storing a completely new copy of the data. Most snapshot-based systems try to deduplicate redundant data where they can so that only a small amount of redundant data is stored. If a snapshot contains only one new file, then it only needs to store that file, the rest of the files can be stored as references. This looks eerily similar to a diff-based system but on a per file basis.

The important optimization is where you decide to deduplicate redundant data. Git does this at a file level. If you have a 1KB source file and you change one character it’ll store another 1KB of data. This sounds inefficient but it works for git because source code files just aren’t that big. This is why git-based-data-version-control systems tend to struggle with large binary files though. You could also break files into smaller chunks, and deduplicate at the chunk level.

Similar to the above, the thing to notice is that we’re gradually sliding toward the other architecture by deduplicating our snapshots until they resemble diffs. Again it’s not exactly the same because you need to define your chunk size for deduplication, whereas with a diff-based system you take those boundaries directly from the user when they commit to the system, but it’s very close. With a small enough chunk size, you’ve basically approached a diff-based system where you heavily deduplicate storage, but need to stitch together many chunks to create a full file.

The ultimate conclusion is that diff vs. snapshot isn’t a real dichotomy, in math terms the two systems are isomorphic.

You can think of a diff-based system as a snapshot-based system that deduplicates its snapshots on commit boundaries, and you can think of a snapshot-based system as a diff-based system that caches a snapshot for every commit.

Knowing that a system is diff-based or snapshot-based really doesn’t tell you much about its performance. For that you’ll need to understand how it optimizes various operations. Understanding diffs and snapshots as primitives is useful, fixating on them as the difference between competing systems is a mistake.

The difference between the two mostly matters to the programmers writing the system, not the user.

Most users assume git is diff based because `git diff` is fast and git allows you to apply diffs as patches, but it’s not. Again the difference between these systems is fairly superficial, but git definitely does not store diffs, it stores commit trees which are snapshots of full commits, deduplicated.

So the diff vs. snapshot dichotomy isn’t useful to understanding the system, what is?

Here’s how we think about it at Pachyderm: we focus on the data.

Large-scale data version control

Users give Pachyderm data to store and we write that data to an underlying storage medium. Later, users can request that data back and we give it to them. We focus on how cheaply and quickly we can do those two things. This way of thinking leads us to the actual fundamental tradeoff in version control systems: you can store less data but it makes accessing that data slower and more expensive, or you can store more data, but it makes accessing that data more efficient.

Notice that this is somewhat the same tradeoff that people perceive in diff vs snapshot systems, but it’s formulated purely in terms of data. This makes it much more useful because it allows us to think quantitatively, rather than qualitatively.

To figure out how to make this tradeoff you need to consider the cost and performance structure of the underlying systems. Pachyderm, like most data version control systems, is designed to run on object storage so we can use s3’s cost structure to compute this. As of April 2021, s3 charges you $0.023 / GB month and $0.0004 / 1000 requests. Do the math and you’ll come to the very important, but counter-intuitive, conclusion that sometimes it’s cheaper to store more data. Specifically it’s worth it to store an extra GB of data if it saves you 57,500 requests a month.

How does storing more data save you requests?

Imagine that you have a 1MB file and you add 1MB of data to the end of it. You could store the new 1MB separately, but then when you want the content of the file it takes two requests, one for the original 1MB and another for the added 1MB. Depending on how frequently it’s accessed, storing the whole 2MB file might result in a lower s3 bill.

The other factor is that two requests is slower than one, this time cost matters when expensive GPUs are waiting for downloads to complete so they can start training on the data. Or when the most important cost of all is at stake: the user’s time. You generally don’t know exactly how often you’ll be accessing your data, and it’s hard to put a dollar value on a request taking a few ms longer, so we don’t know exactly where the global maximum of this tradeoff is. What’s important is that we understand it exists, and think in these terms.

Ultimately this tradeoff can be reduced to a single value: the optimal chunk size. This is the optimal size of a chunk of data in object storage. Smaller than this and your request costs overwhelm your storage cost savings, bigger and the reverse happens. Again, this is analogous to our diffs vs snapshots comparison, lots of very small chunks would be the equivalent of storing diffs; you can deduplicate your storage heavily but reconstructing a set of files would require requesting many chunks. Storing fewer larger chunks reduces the number of requests, but increases the storage overhead.

The optimal chunk size depends on the underlying object store and the request pattern, which is why it’s configurable. Again you’re unlikely to know this number exactly, but you can ball park it, and the difference between the optimal value and something in the right ballpark is unlikely to matter. We also do internal tests on various object stores and find that generally for most workloads on modern object stores single digit MBs makes for a good chunk size.

There is one last important optimization to consider. Once we actually have an optimal or near-optimal chunk size, how do we effectively use that information? Suppose we decide 8MB is the correct value. The naïve approach would be to take the data we want to store, chunk it up into exactly 8MB chunks and content-address those chunks. That way if we see two chunks with the same content we deduplicate them. The challenge with a naïve adherence to a set chunk size is that it falls prey to the shifting boundaries problem. It works great when data is only added at the end of a files, but can be very inefficient in write patterns with insertions and deletions. Suppose someone adds a single insertion in the middle of a file. With strict chunk sizes, all the data shifts and every chunk after the insertion may have changed! This blows up the storage costs.

The solution is to do content-based chunking in addition to content-based addressing. This allows us to keep most of the chunks the same when a small amount of data changes regardless of where the change occurred.

Content-based chunking is an interesting topic worthy of its own post, but the short version is that you derive chunk boundaries based on the content itself. To do this you compute a rolling hash of the bytes and declare a chunk boundary when that hash value has a certain property. For example you might do it when the first `n` bits of the value are all 0. A higher value of `n` means you’re less likely to find a chunk boundary which means fewer, larger blocks. Given a desired block size you can select an `n` that probabilistically gives you blocks around the size you want.

Pachyderm applies this content-based chunking and deduplication to all the data it stores, and the metadata that indexes that data. When you apply this technique throughout you get a system where the tradeoff isn’t determined by an architectural choice of diffs vs snapshots, it’s determined by the data itself and the underlying storage medium. If your data set has lots of small additions the storage pattern will look a lot like a snapshot-based system, large additions, it’ll look a lot like a diff-based system. With most real workloads, it will look like a mix of the two. In each case the storage pattern is optimized to get the most performance out of the underlying storage for the users using the system. It’s not based on an implementation detail of whether our primary data structure is diffs or snapshots.

Hopefully that gives you a much better sense of how we solve the complex problems of storage for data science at Pachyderm. While many tools out there have focused on the 20% of data science that’s experimenting and putting a model into production, we focused on one of the most challenging problems in computer and data science, storing and versioning data at massive, petabyte scale.

If you want to learn about Pachyderm’s product or dive our hybrid data versioning algorithms you can Request a Demo, check out our Pachyderm Storage Architecture Webinar, explore the code on GitHub, or try the product for free on Pachyderm Hub.

--

--