Data Version Control (DVC) with Google Cloud Storage and Python for ML

Data Version Control is an upcoming area necessary for faster implementation of machine learning iterations and still track the changes in data and models.

Avinash Kanumuru
Geek Culture

--

Introduction

A machine learning project lifecycle is different than a normal software lifecycle where there is not much dependence on data, whereas in machine learning, each model depends on underlying data and the model behaves differently when data changes.

In simple terms — data changes > ml code needs recalibration > model changes

So there is an absolute need for tracking not just code but the data used to build that model.

Welcome to Data Version Control

In this article, we will look at a tool called dvc which is very similar to git but for data. For people not familiar with git, git is an open-source version control system to track changes in the set of files and keep a history of those changes. This is majorly used in all software development lifecycle and is one of the components of DevOps. (We will look at a similar concept of MLOps later in a different article).

--

--

Avinash Kanumuru
Geek Culture

Leading Data Science & Engineering in FinTech space