TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial…

Member-only story

PYTHON | DATA | PROGRAMMING

Introduction to Data Version Control

A step-by-step guide to implementing your own DVC in Python using Hangar

David Farrugia
TDS Archive
Published in
7 min readAug 18, 2023

--

Photo by Florian Olivo on Unsplash

What is Data Version Control (DVC)?

Any production-level system requires some kind of versioning.

A single source of current truth.

Any resources that are continuously updated, especially simultaneously by multiple users, require some kind of an audit trail to keep track of all changes.

In software engineering, the solution to this is Git.

If you have written code in your life, then you are probably familiar with the beauty that is Git.

Git allows us to commit changes, create different branches from a source, and merge back our branches, to the original to name a few.

DVC is purely the same paradigm but for datasets. See, live data systems are continuously ingesting newer data points while different users carry out different experiments on the same datasets.

This leads to multiple versions of the same dataset, which is definitely not a single source of truth.

Additionally, in a machine learning environment, we would also have several versions…

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

David Farrugia
David Farrugia

Written by David Farrugia

Data Scientist | AI Enthusiast and Researcher | Talks about Python, AI, and Data. Get in touch — davidfarrugia53@gmail.com

Responses (1)