Member-only story
PYTHON | DATA | PROGRAMMING
Introduction to Data Version Control
A step-by-step guide to implementing your own DVC in Python using Hangar
What is Data Version Control (DVC)?
Any production-level system requires some kind of versioning.
A single source of current truth.
Any resources that are continuously updated, especially simultaneously by multiple users, require some kind of an audit trail to keep track of all changes.
In software engineering, the solution to this is Git.
If you have written code in your life, then you are probably familiar with the beauty that is Git.
Git allows us to commit changes, create different branches from a source, and merge back our branches, to the original to name a few.
DVC is purely the same paradigm but for datasets. See, live data systems are continuously ingesting newer data points while different users carry out different experiments on the same datasets.
This leads to multiple versions of the same dataset, which is definitely not a single source of truth.
Additionally, in a machine learning environment, we would also have several versions…