The Guide to Data Versioning

Already familiar with versioning code with git? A look at how it works to version data using the same abstractions.

Paul Singman
Whispering Data
10 min readDec 13, 2021

--

“I have never lied to you, I have always told you some version of the truth.”

“The truth doesn’t have versions, okay?” — Something’s Gotta Give (2003)

Jack Nicholson and Diane Keaton discuss data versioning in Something’s Gotta Give

Table of Contents

  1. Introduction
  2. What is Data Versioning?
  3. Why is Data Versioning Important?
  4. How is Data Versioning Implemented?
  5. Examples Using Data Versioning
  6. Data Versioning Best Practices
  7. Conclusion

Introduction

A version of something is defined as “a particular form in which some details are different from earlier or later forms.” In the digital world, versioning is a luxury we are fortunate to indulge — maintaining multiple versions of pretty much anything, from small objects to whole systems.

Many things we interact with are versioned automatically — word documents, codebases, the software that runs our precious phones. We rarely think twice about it.

The reason so many things are versioned is that it produces an invaluable record of incremental changes made and when they occurred. The ability to inspect this log is super helpful when understanding why the value of a certain datapoint is what it is.

What’s more: the ability to navigate between different versions is a superpower that acts as a virtual form of time travel. It’s no surprise that every software bug report starts with the same question: which version are you running?

Versioning in Data

Data is one area where versioning is still in its relative infancy. Why is this?

Well, data can be quite large — arguably larger than anything else in the digital world — and it is non-trivial to maintain multiple versions of something so heavy.

Non-trivial… but not impossible.

As we’ll explore, a versioning system for any size of data can exist with the right data structures and abstractions in place to efficiently map data objects to the versions they are a part of.

In business-critical data environments, versioning is increasingly seen as a vital component and not simply a nice-to-have premium. Before we dive into why this is, let’s take a step back and define what versioning is in the domain of data.

What is Data Versioning?

Fundamentally to version data means to create a unique reference for a collection of data. This reference can take the form of a query, an ID, or also commonly a datetime identifier.

This general definition qualifies a range of approaches as being “data versioning.” It includes something like saving an entire copy of the data under a new name or filepath every time you want to create a version of it.

It also includes more advanced versioning solutions that optimize storage usage between versions and expose special operations to manage them.

We’ll discuss how these work in more detail in the How Data Versioning Is Implemented Section below.

Why is Data Versioning Important?

Data versioning is important because it allows for quicker development of data products while reducing errors.

Accidentally deleted a petabyte of production data? Restoring a previous version of a dataset is a lot easier than re-running a backfill job that can take a whole day to complete.

Need to identify records that changed in a table without a reliable last_updated field or change-data-capture (CDC) log? Saving multiple versions of the data as snapshots and querying for differences will do the trick.

As these examples show, minimizing the cost of mistakes and exposing how data has changed over time are two ways to increase the development speed of a data team. Data versioning is the catalyst making this possible.

Taming Complexity in Modern Data Environments

Versioning data has always had value. But it has particular significance in modern data environments that are being asked to do more than feed internal reporting. In an increasing number of organizations, data supports a myriad of mission-critical business processes, which brings increased responsibility and complexity.

Nick Schrock, CEO of Elementl, states in an episode of the MAD Data podcast how he sees data environments evolving as a result of this shift (minor paraphrasing and emphasis):

It is about the embrace of cloud technologies, the embrace of managed services, and as importantly, the embrace of engineering practices to tackle and control that complexity…Integrating software engineering practices throughout the data ecosystem is the way to manage Big Complexity and will dominate the next 10 years of development.

The software engineering practices referred to? Things like unit tests, integration tests, CI/CD deployment, and of course, versioning. We see this theme continuing to play out and data versioning gaining adoption across the data ecosystem in the coming years.

Next, let’s look at a few ways we can implement data versioning, starting with basic forms and working our way up to more advanced solutions.

How Is Data Versioning Implemented?

There are several ways to implement data versioning. We’ll cover three instructive approaches below:

Approach #1: Full Duplication

Have a dataset you want to see how it’s changing over time? One option is to save a full copy of it under a new location each time you want a version of it. This works best for smaller datasets with something like a daily versioning frequency.

Versioning via saving a full copy of an example users dataset daily.

While this approach does create versioned data, it does so in the least space-efficient way. In the illustration above, any block that stays green is an example of a data object that hasn’t changed but is now duplicated across each version.

Furthermore, code or queries that interact with this versioned data will be error-prone with the correct date value having to be manually hardcoded in different places.

Although not the most elegant solution, it is an easy way to get started versioning data.

Approach #2: “Valid_from/to” Metadata

A more space-efficient and incremental approach to versioning works by adding and maintaining two metadata fields in a tabular dataset, often named valid_from and valid_to. When updating a record in this dataset, we make sure to never overwrite an existing record. Instead, we append new records and update the valid_to field to the current timestamp for any record that would have been overwritten.

Besides being something you can implement in your own ETL scripts, it is also notably the approach SQL Server uses for its Temporal Tables feature and what dbt uses for dbt Snapshots.

Using query filters to get the state of the Orders table on Oct. 17.

This approach works quite well for “time traveling” throughout a single collection of tabular data. However, it provides only one method of interacting with the versions — which is to add filters to queries on the metadata fields as shown above.

Yes, we can materialize the table as it was at various points in time, but this also what we are limited to.

Approach #3: First-class Data Version Control

If the first two approaches can be summarized as “Let me add a bit of versioning to the data I already have”, now it’s time to change mindsets. Instead, we should think of versioning as a first-class citizen of our data environment. An inherent property of any data we introduce into the system.

To make this possible — as made clear by the limitations of the above approaches — we need to solve a few challenges.

  1. Minimize the storage footprint of data versioning. This means not creating copies of data objects that remain unchanged between versions.
  2. Expose operations that let us interact directly with the versions. Things like “create a version”, “delete a version”, “compare to versions” are examples you can think of off the top of your head.
  3. Work the same over any scale of data, data format, and both structured and unstructured data.

Now, these are not simple problems. And you probably won’t hack your way to a solution in an afternoon or even a weekend.

One of the more popular approaches to solving these extends the git version control model to data. We see this in projects like lakeFS, DVC, and git LFS. Let’s take a closer look at how the open-source lakeFS solves the above challenges.

Versioning in lakeFS

Borrowing useful abstractions from git, lakeFS lets you creates versions of data via commits, which in turn belong to branches. In essence, “creating a commit” is synonymous with “creating a version” (Challenge #2).

The above diagram shows the full relationships between the actual datafiles being versioned in lakeFS (bottom row), all the way up to the commit and branch abstractions exposed (top).

The important concept is that duplication of datafiles is minimized between commits (Challenge #1). This is depicted by the arrows going from one Metarange to multiple Ranges and the arrows from one Range to multiple objects.

An even more detailed look at these relationships can be seen below:

Detailed looks at the contents of a Range in lakeFS.

What’s the point of all this? When it comes to time travel, we now have a super-easy way to navigate amongst the different data versions (by using the unique generated lakeFS commit_id).

Additional commands for interacting with the versions can be found here.

Examples Using Data Versioning

Let’s give a clearer picture of how data versioning is useful in different contexts by walking through some examples.

Data Versioning in Machine Learning

Say you work for a company that uses machine learning algorithms to enhance grainy video footage and identify objects. The users of the product use the enhanced footage for a variety of commercial uses.

Image source: https://dzone.com/articles/the-most-insightful-computer-vision-project

A new and improved algorithm is developed by the data scientists that improve the classification accuracy of the outputs. However, it is not possible to roll out the new algorithm to all users at once. Instead, for a period of time, we need to let users switch between the classifications of both algorithms.

We could save the outputs of both versions of the algorithm to different paths in the object store. For one algorithm with just two versions, you could get away with this “hardcoding filepaths” type of approach. When the numbers of both algorithms and versions increase though, developers will start to make a greater number of mistakes — forgetting to update the path of a version or losing track of what parameters we’re used for a particular version.

Incorporating data versioning raises useful abstractions (e.g. commit messages, branch names) to manage the outputs in a more sane manner.

Data Versioning in Analytics

The foundational process of analytics is creating metrics that contain the logic to evaluate a business and user behavior. The logic for important concepts like “Active users”, “Sessions”, “Churn Rate” are often defined in SQL and calculated on a regular cadence.

A common problem in analytics is that a query that ran fine a day ago might start causing errors because of a change in the data. The most effective way to figure out what is causing the issue is to run the same query over the data as it was when the error first occurred.

Having a version of the data available at this time, simplifies the debugging process and results in data errors getting resolved faster.

Data Versioning Best Practices

When implementing data versioning, here are some tips we’ve found helpful.

Use TTL’s to expire older versions

It may be the case that there’s no need to retain versions older than 30 days or a year, for example. But no one wants to play housekeeper and be responsible for deleting older versions of data. And so they start to accumulate.

If there’s a time duration you know versions are no longer relevant, a TTL (time to live) policy is a great way to have older versions of data be deleted automatically. If using a versioning system that doesn’t support TTLs, periodically running a version clean-up script can achieve the same effect.

Version semantically, not temporally

There’s nothing wrong with using a consistent daily or hourly cadence to create new versions of a dataset. In many cases, finding the version with a created_at time closest to when you want is good enough for figuring out what you want.

What makes data versions even more meaningful is tying versioning to the start and/or completion of data pipeline tasks. And ETL script finished running? Create a version. About to send an email to your “highly engaged” users? Create a version first.

This lets you include more meaningful metadata around your versions beyond the time they were created and lets you figure out what happened much faster if something goes wrong.

Leverage Versioning to Work Collaboratively

One of the challenges in data environments is to not step on the toes of your teammates. Often data assets are treated as a sort of shared folder that anyone can read, write to, or make modifications to.

One way to avoid these problems is to create personal versions of the data when developing. This prevents the chance that a change you make inadvertently affects another.

Conclusion

There’s a noticeable movement in the data space to adopt the mindset of treating “Data as a Product”. We believe this is a positive trend for data orgs and it requires a “leveling up” of the way data teams operate.

Development best practices like CI/CD, testing, and version control are features you need to be thinking about if you want your data team to confidently take on these types of projects.

This article was originally published by Paul Singman on the lakeFS blog.

--

--

Paul Singman
Whispering Data

Data @ Meta. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.