Version Control for Data Science — Tracking Machine Learning models and datasets

Version Control for Machine Learning Projects using DVC

VJ
7 min readSep 8, 2019

I am a Git god, why do I need another version control system for Machine Learning Projects?

Undoubtedly, GIT is the holy grail of versioning systems! Git is great in versioning the source code. But unlike software engineering, Data Science projects have additional big-ass files like datasets, trained model files, label-encodings etc. which can easily go to the size of a few GBs and therefore cannot be tracked using GIT.

Tell me the Solution?

The amazing bunch of people at https://dvc.org/ have created this tool called DVC. DVC helps us to version large data files, similar to how we version control source code files using git. Also, DVC works flawlessly on top of GIT which makes it even better!

Most of the time, tracking of datasets and models are ignored in Data Science workflows. Now with DVC we can track all…

--

--

VJ

Exploring | Observing | Learning | Experimenting