Perfect Way of Versioning Models & Training Data

Published in

Tensor Labs

8 min readMar 30, 2022

Hey, there fellow ML Engineers! If you’ve been following my previous posts I am assuming that your ML Journey is going great thus far and you already have your hands on CML or CI/CD for Machine / Deep Learning.
Most probably like me, you found it fun and extremely useful and now are ready for the next phase in your journey.

The majority of us in the AI industry face the problems such as the versioning of training data. For example, we have an existing dataset and new data is added for training and after training, we find out that the new data added is causing issues in performance since it had a lot of noise and now we want to revert to the original state of the data. We don’t have any explicit way of doing this. The common practice we normally use is keeping separate folders or drive links which we all know is a pretty naive and faulty approach.

Another problem that arises along with this is when we have to share data with other teammates who are working on the same project. While we are talking about versioning of data another such problem is model versioning. Believe me, I have seen people saving model weights as first_model.h5, second_model.h5 , final_model.h5 etc 😅

Another problem that we often come across is sharing the data, if multiple people are working on the same problem we might have to create an s3 bucket where everyone can access the data and upload files manually, the same goes for models. Most of the time these repositories are kept public for ease of sharing between team(s) members.

We can essentially solve the same problem with DVC. In this article, we’ll go through in detail how you can do that in a proper way with Data Version Control or DVC. We can essentially use DVC with CML that we discussed earlier. DVC allows you to version your model/data. Now that we know about the problems we are facing let’s get started with what DVC is and how we can utilize it to version our models and data.

Today’s article will comprise the following main points
1) What is DVC?
2) How can we add DVC to our own repository? (Hands-on tutorial)
3) What are the scenarios where we need this?
4) Which cloud / local platforms can we use with DVC as remote?

What is DVC?

DVC is built to make ML models shareable and reproducible. It is designed to handle large files, data sets, machine learning models, and metrics as well as code.

DVC is an open-source tool for data science and machine learning projects. One of the main features that DVC provides and which makes the learning curve quite easy is that we can essentially use DVC with “git-like” commands with which we are familiar. Not only that we can version our data and/or our machine/deep learning models using DVC with a few simple commands that are quite easy to learn. The overall concept of DVC looks something like this

DVC is something that essentially works with git. The main goal here is to keep our data and our code versioning separate while still maintaining the versioning and consistency between them. In our codebase where git handles all of our code and its versioning, DVC handles all of our data or tracked large files or model files. DVC assists git in managing the data while keeping data at a separate location other than our code but still making it available for anyone who uses our code/repo.

How can we add DVC to our own repository? (Hands-on tutorial)

Let’s get started with DVC, for that we’ll be making a new repository, and let’s call that repository Data-Version-Control. Fairly simple task and we all know how to do that right?

Once we have created the repository let’s get started with actually writing up some code. I am creating a file in my repo for the purpose of a demo called “generate-data.py” and what this file does is generate a CSV and add data to it. The reason why I am doing this is that I want to show how you can track the created data and essentially the same way you can also track your models or any large files for that purpose. The generate_data.py looks something like this.

In order to get started with DVC, we go to the cloned repo and type

dvc init

in the same way, we do git init when initializing git for our projects, and it will show in output as shown below

The next thing we need to do is to set up a DVC remote (you can think of a DVC remote as a remote location where the tracked data or models are actually stored). It works exactly the same way as git such that in git your repository is the remote where your code actually lives. We have to do the same with DVC. You can essentially use any cloud-based storage i.e s3, GCP, AWS or even Gdrive.

For the sake of simplicity, I am going to create a new folder on google drive and use it as a storage remote for my large files.

Once you have created the folder, go inside the folder and look at the URL.
From the URL we need to copy the id for that specific folder as a reference which we will provide to DVC and tell DVC specifically that this is the particular cloud storage where we want to keep our data. We can do the same for AWS s3 or GCP.

The id is everything you see in the URL after the “folders/”. Once we have the id we can set the remote as default for our DVC in our project.

The command to do it is shown below and can be broken down as

dvc remote add {-d is default remote flag} {remote name(demo-remote)} {gdrive (cloud provider name)} :// {remote storage folder id}

Now that we have our DVC initialized and we have told our DVC which location it can use as remote storage for our tracked files/models, the next step is telling DVC which files it needs to track. In our case, we do not have any such files so I am going to run generate_data.py to generate a fairly large CSV and tell DVC to track it.
Before moving forward let’s commit all the progress that we have made up until now using


git add .
git commit -m “Initial Push”
git push

On the initial push, these are the files that are created in our current repository.

We can see the DVC ignore and .dvc files are now present in our repo along with the generate_data.py file. Moving forward firstly I run generate_data.py and it gives me the following

As we can see we have a 96MB file now in our repository. For the sake of this tutorial, let us assume that this is the data/model file that we need to track. Again I am running generate_data.py just to create a large file. In your repo you already have one so you can directly tell DVC to track that.
In git, we do “git add {filename}”
In DVC we do

dvc add {filename}.

So doing “dvc add data.csv” gives me the following output

This essentially tells DVC which files it needs to track. Let’s save the DVC tracking files in our repo by pushing those files using git push.

Now our DVC and git repo is all set to be used. Just like we just pushed all the code changes to our git repo using “git push” we need to do “dvc push” to push all of our data changes.

As you can see our data is now being uploaded and that’s it. And we can verify that with our Gdrive folder.

Now every time we make any change in the data all we need to do is DVC push and can version it with each push in accordance with our branch. The amazing thing is if anyone clones our repository all they need to do is DVC pull and they’ll have all the data. With every push just like the git branches, we will have versions of data or models and we can essentially go back to the versions that we want.

What are the scenarios where we need this?

Scenarios are explained below:
1) Model versioning
2) Training Data versioning
3) Keeping track of data or large files
4) When the model file is too big to go with code in the git repo we can save it here and anyone can fetch the model binaries or weights and use them.
5) Working with data/models in teams, DVC makes it easier and more formal to share data b/w team members while keeping perfect track of it.
6) Majority of the time after model finalization the model binaries or weight files are kept separately from the actual repo, this creates consistency for that.

Which cloud / local platforms can we use with DVC as remote?

AWS
Azure
SSH
HTTP
Google Drive

Conclusion

dvc works alongside git and is a core component to have continuous machine learning with CI/CD tools and is the first and most important step in the MLOps process. What git does to code, dvc does to data — to track and maintain the changes and later retrieve the versions when needed. Also, these 2 tools keep the code clean, with code separated from data (data can be often huge and have security issues) we will have better control over the whole process.

If you want to learn more and are eager to dive deeper into this do check out the links here or at DVC GitHub here.

Feel free to play around with my repository here. Hope you liked the article and if you did leave a clap below, please.

Till next time folks :)