How to Version Control Your Machine Learning Code and Datasets

Bagus Awienandra
Analytics Vidhya
Published in
6 min readFeb 28, 2021

Version Control

How many millions of codes is needed to write a Windows?

Have you ever asked yourself how big and complex is our operating system or our search engine? Based on the data the Windows XP consists of 40 millions line of code, Facebook consists of 62 million lines of code and Google consists of 2 billion of code [1]. These numbers are huge and you may be wondering “How can their engineers collaborate to maintain these codes and make sure that every engineer inside these companies has the same code with the same version at the same time?”. If we talked about the number of their developers, these big companies have thousands of engineers maintaining these huge systems. The answer is we need to use version control software.

Using Git

Github Workflow

Nowadays, we have a very stable and mature version control software like Git. There is other software with the same purpose, such as SVN and Mercurial. But in this article, I will only cover Git due to its popularity. Using Git in your local is very simple, you can start to control the version of your codes after done several steps of initialization and learning several syntaxes. But the interesting part is Git has several providers to host your code in a cloud such as Github, Gitlab, and Bitbucket. They let you back up your code and their version history freely inside their cloud service and you can invite others to collaborate on your project. Based on the image above, the workflow of Git is very simple. In your local computer, you can add your code to git and then commit it. After all of the changes are made, you can push it to the remote server (in this case to Github server) and the other engineers can pull it and start the collaboration [2]. I will give you a little tutorial to start your git journey.

Installing Git

First you can download git from this link (https://git-scm.com/downloads). After the installation process, you can use this command to check if git already installed in your computer.

$ git --version

Sometimes before you start using Git, you also need to determine your git user name and email.

$ git config --global user.email ["you@example.com](mailto:%22you@example.com)"
$ git config --global [user.name](<http://user.name/>) "Your Name"

Create a new folder for your repository (a repository is a place for your source code) and initialize the version control inside your repository.

$ mkdir your_repository
$ cd your_repository
$ git init

Create your first commit

After the initialization, you can create a new file (for the example a python file) and add it to the git and commit it. After that you can see the logs through log command.

$ touch hello_world.py
$ git add .
$ git commit -m "My first commit"
$ git log

Step up your level

Git is very useful and currently considered the main tool for machine learning engineers, software engineers, and data scientists. But if we want to get deeper into Git, it has several limitations and one of them is handling large files. Git was designed to handle source code files and their sizes usually are under 1 MB. In the case of a machine learning project, our datasets will have sizes bigger than that and it’s quite troublesome for Git to handle it. Currently, Git has a service called Git LFS to handle files of large size, but I will introduce you to the other solution called DVC (Data version Control).

Using DVC (Data Version Control)

Data Version Control Workflow

DVC or Data Version Control is a Git-like version control software or library that uses git as its backbone to create datasets versioning [3]. Based on the image above, we can see the workflow of DVC. First, we need to create add the files to DVC and then DVC will create metadata for those files. This metadata will be committed to git and it will be used for data versioning. The interesting feature of DVC is we can use cloud providers such as S3, Azure, Google Cloud, and even Google Drive to be our data repository (In the implementation we will have our source code in Github and our large data in Google Drive). Like Git, I will give you a very simple tutorial for getting started your local DVC.

Installing DVC

You can download the DVC through this link (https://dvc.org). If you are a macOS user or Linux user, you can install it through brew or apt. After the installation finish, make sure it works using the command below.After installation complete, we want to use our “your_repository” repository for this tutorial. You need to change the directory to “your_repository” folder from Using Git tutorial and then initialize DVC.

$ brew install dvc
$ dvc --version

After installation complete, we want to use our “your_repository” repository for this tutorial. You need to change the directory to “your_repository” folder from Using Git tutorial and then initialize DVC.

$ dvc init

After that you can commit the changes using git.

$ git status
Changes to be committed:
new file: .dvc/.gitignore
new file: .dvc/config
...
$ git add .
$ git commit -m "Initialize DVC"

The next step is to create or copy any large datasets files to your “your_repository” folder. (In this example, I will create an empty CSV file)

$ touch dataset.csv
$ dvm add dataset.csv
$ git add .
$ git commit -m "Add dataset"

Congratulations, you already add version control for your datasets.

Add remote cloud storage

The experience is not convenient if we are only storing the version of our dataset inside our local computer. Our project will become bigger and bigger, so we need to share our dataset with other engineers. We will add remote cloud storage to save our data in the cloud so the other can access and pull it. First of all, you can create a new folder inside your Google Drive, and then you can copy the folder ID from the URL.

$ dvc remote add -f --default myremote gdrive://<your folder id>                                                                                             
$ git add .dvc/config
$ git commit -m "Configure remote storage"

After that, you can commit the config file and start to push the datasets to your Google Drive Storage. Your files will appear in your folder directory and you can pull them any time and from anywhere.

Final words

Your machine learning project will become bigger and you will invite other engineers to contribute and collaborate. Git and DVC are great tools for us to make sure that everyone in our project is having the same state of code and datasets at the same time.

References

[1]: https://www.informationisbeautiful.net/visualizations/million-lines-of-code

[2]: https://git-scm.com/doc

[3]: https://dvc.org

--

--