Boost Your Data Engineering Toolkit In 2023 With Data Version Control
Can you imagine what building new software was like before distributed version control systems like Git became the industry standard? 😱It’s a miracle that teams managed to ship things and roll out updates.
Now get this: this is exactly where the data world is currently at. 🤯
As companies started generating more data than ever, engineers moved from relational databases and embraced object storage. This brought new manageability challenges every data engineer knows firsthand.
💡But here’s some good news: more and more teams are busy developing data versioning solutions to increase velocity and slash the cost of errors.
I rounded up the data version control tools making their mark on the industry right now to help you pick the best one for your project.
TL;DR
Overview of data version control tools Dolt, DVC, Git LFS, lakeFS + their key advantages and drawbacks.
Table of contents:
The central pain point data versioning solves
4 data version control systems you should know:
What is data version control?
Here’s a no-brainer for anyone who has ever developed software in a team:
When you’re working on a project, it is important to keep track of all the changes your teammates make.
Scratch that, you actually need to track all the changes you make!
Git lets you commit changes, create different branches from a source, and merge back the branches to the original.
Source: lakeFS
Data version control uses the same idea and applies it to datasets, not source code.
Live data systems ingest new data all the time. At the same time, various users keep experimenting with the same datasets.
This can easily become a nightmare filled with multiple versions of the same dataset. You can’t exactly call this a single source of truth, right?
In machine learning environments, you might end up with a tangled web of datasets and experiments if the models are not properly audited and versioned.
This is where data versioning comes in to save the day!
By implementing data version control, you can track datasets and register changes on a particular dataset.
That way, you can easily check how the project developed over time — seeing everything that was added, modified, and removed.
Plus, you can always switch to an older version if you face an unexpected issue in the current version. Or if you accidentally delete data. You’ll quickly pinpoint the culprit because you’ll have a detailed document in front of you describing the differences between versions.
The central pain point data versioning solves
If you’re an administrator or user of databases, data warehouses, and data lakes, you may run into a common problem: the data you have represents only the current state of the world.
Since the world is always changing, this data changes as well. If you want to get back or look into an older data status, you can dive into a log file and restore it — but this method isn’t exactly handy for data analytics purposes.
This is the problem that data versioning solves. Advanced data versioning helps users to set up a secure system for storing data. For example, in machine learning, data scientists might test their models to increase efficiency and make changes to their datasets. With this type of versioning, you can quickly capture the different versions of their data and models in commits, switching between these different datasets as needed.
You can keep your machine learning models and data consistent by creating a single history for them. This helps your team use logical file names and allows you to store your data in different storage solutions. Data versioning also paves the way to meeting data compliance requirements by letting you audit changes to your data.
4 data version control systems you should know
1. Dolt
What is it?
Dolt is an open-source project which integrates a versioned database created on top of the Noms storage engine. This is how Dolt enables Git-like operations for data.
How does it work?
Dolt works with a data structure called a Prolly tree. A Prolly tree is a block-oriented search tree, which combines the properties of a B-tree and Merkle tree. This delivers results because it uses B-tree to hold indices in relational databases, letting you balance its structure and offering good performance for reading or writing from a database.
When to use it?
Pick Dolt if you’re using a relational database and want to continue doing so while enjoying version control capabilities.
Any limitations?
Dolt won’t work for you if the data you have isn’t in a relational database- or if you’d simply prefer to keep it somewhere else. Managing data on the petabyte scale would be tricky. And if you care about speed, Dolt’s structure is less efficient. Another bad use case for this solution is a scenario where you rely heavily on unstructured data. Then it’s time to look for another solution.
Source: lakeFS
2. Git LFS
What is it?
Git itself can’t scale for data, but you can use an add-on called Git LFS to manage data and code at the same time.
How does it work?
Git LFS relies on metadata management. The idea behind Git LFS derives from game development. Game developers usually deal with code but also tons of artifacts. Most of the time, these are binaries affecting the games’ look. Game devs used to manage such assets together with code, which made their repositories heavy and complex. So, they developed an add-on to Git that lets you avoid this.
When to use it?
This use case expanded to engineers doing machine learning and research — they need to deal with files that aren’t code and are heavier than what you’d typically expect. This includes files that manage code. It’s best to keep them together due to the connection between the model and the data on which it’s running.
Any limitations?
Git LFS smoothly integrates with Git repositories. But using it means that both your code and files are going to live there. You’ll have to lift and shift data to coexist with your code that way, which is time-consuming.
Source: lakeFS
3. DVC
What is it?
DVC is an open-source version control system for data science and machine learning projects. The solution was designed to work with version-controlled systems like Git.
How does it work?
When you’re adding data to a project using DVC commands, the solution uploads the data to a remote storage service and creates metadata pointing to that location. This metadata file is then added to a Git repository to enable version control. When you add, remove, or modify data files, the metadata file will be updated and new data uploaded. This lets you keep track of data and easily share it with collaborators without actually storing it all in a repository thanks to metadata files.
When to use it?
DVC is a suitable solution if you’re looking to track and save data and ML models the same way you save code, switch between different versions smoothly, and understand how datasets and ML artifacts were developed.
Any limitations?
DVC lakes some important features of a relational database. If you use one, it’s probably not the best choice for you. Also, caching is no longer unrealistic when you’re operating on the petabyte scale and using hundreds of millions of objects.
Source: lakeFS
4. lakeFS
What is it?
lakeFS is an open-source version control solution that sits on top of the data lake and is based on Git-like semantics. It lets you create isolated data versions, share them with your team, and quickly merge changes into your main.
How does it work?
lakeFS supports data in AWS S3, Azure Blob Storage, Google Cloud Storage, and any other object storage that has an S3 interface. It integrates well with popular data frameworks such as Spark, Hive Metastore, dbt, Trino, Presto, and others. lakeFS basically brings together all data sources in data pipelines — from analytics databases to key-value stores — using a unified API that lets you manage the underlying data in all the data stores.
When to use it?
If you’re looking for an ETL testing solution, lakeFS is a brilliant match. Other use cases include rolling back to a previous data version in case of an error, keeping track of the exact state of data over time, and implementing CI/CD pipelines via hooks that enable automation of checks and validations of data on lakeFS branches.
Any limitations?
Source: lakeFS
Wrap up
Using the right version control tools, you can transform a chaotic environment into one that you can manage. You gain full control of your data — you know where it comes from, what has changed, and why.
You become that person in the room who can stop the manager before they make a decision based on inaccurate data.
💡Curious to learn more?
Here’s a detailed comparison of DVC, Git LFS, and lakeFS.
Check out lakeFS docs for more guides and detailed information on how to get started.
And don’t forget to join the awesome lakeFS data practitioners’ Slack community to meet like-minded people, ask questions, and share your wicked smart insights!