Geek Culture
Published in

Geek Culture

Data Version Control (DVC) with Google Cloud Storage and Python for ML

Data Version Control is an upcoming area necessary for faster implementation of machine learning iterations and still track the changes in data and models.

Introduction

A machine learning project lifecycle is different than a normal software lifecycle where there is not much dependence on data, whereas in machine learning, each model depends on underlying data and the model behaves differently when data changes.

In simple terms — data changes > ml code needs recalibration > model changes

So there is an absolute need for tracking not just code but the data used to build that model.

Welcome to Data Version Control

In this article, we will look at a tool called dvc which is very similar to git but for data. For people not familiar with git, git is an open-source version control system to track changes in the set of files and keep a history of those changes. This is majorly used in all software development lifecycle and is one of the components of DevOps. (We will look at a similar concept of MLOps later in a different article).

The core concept here is we track the file changes and store the file changes in a remote place with an identifier hash using push commands and later can be fetched when we need a version of code/data.

For code changes using git, we have options like GitHub, BitBucket, GitLab, etc.., Similarly for data changes (we are talking about huge files), we need a place to store these in remote storage.

With this approach, we are separating out our data from code and data can be fetched based on the requirement.

Though dvc can be used to store data remotely in a variety of platforms including — (but not limited to) AWS, Google Drive, Local Folder, SSH — but in this article, I’m going to use Google Cloud Storage (as I had trouble doing it end-to-end, including changing ml code and I couldn’t find all instructions in a single place.)

Prerequisite

You have installed Google Cloud SDK and Credentials (either end-user or service-account) are saved in a location that is added as GOOGLE_APPLICATION_CREDENTIALS in your environment variable. This process is not covered in this article.

Let me demo with one of my git repo.

git clone https://github.com/avinashknmr/predict-game-result.git
cd predict-game-result

The folder structure is as below

predict-game-result
├── Pipfile
├── Pipfile.lock
├── compare.py
├── data
│ ├── game_results.csv
│ ├── player_details.csv
│ ├── player_type.csv
│ ├── prediction.csv
│ ├── team_data.csv
│ ├── upcoming_fixtures.csv
├── download.py
└── predict.py

We will next set up DVC and add files to it. To do that, first, let’s install dvc using your preferred installer. I use pipenv but the command is the same for pip also. For macOS, you can install via Homebrew brew install dvc but if you are using it within a virtual environment, your code needs to have [gs] and need to install within the virtual environment as well. For detailed documentation, please visit dvc.org.

pipenv install "dvc[gs]"pip install "dvc[gs]"

[gs] is required to use Google Cloud Storage for this demo. And it is assumed that we have GOOGLE_APPLICATION_CREDENTIALS configured. Else the later part of this setup will fail to push data to Google Cloud Storage.

Let’s initiate dvc. A lot of dvc commands are similar to git as it is built on top of it except for few changes.

dvc init

Let’s add Google Cloud Storage as default remote storage to this repo.

# dvc remote add -d remote_storage gs://<bucket-name>/<folder-name>
dvc remote add -d remote_storage gs://ai-ml-dvc/predict-game-result

dvc config file looks like this .dvc/config/

[core]
analytics = false
remote = remote_storage
['remote "remote_storage"']
url = gs://ai-ml-dvc/predict-game-result

Next, let’s add all the data files for Data Version Control

dvc add data/*.csv

At this moment .csv.dvc are created under datafolder that tracks the data versions. Now let’s see the contents of a .csv.dvc file. It contains a md5 hash that links to the original file. Later when we push data to remote_storage, the file is saved with this hash name. The md5 hash changes when the file changes and it is tracked via git.

outs:
- md5: 1df2045c428b6bbb05a4218328ee8bff
size: 225002
path: game_results.csv

Let’s add all unstaged changes

git add .

Commit all the staged changes

git commit -m "first commit with dvc files"

Now let’s push the data files to Google Cloud Storage with —

dvc push

Next, we will push the changes to the git repository so that changes are tracked and saved.

git push -u origin develop

Now let’s check our Cloud Storage bucket and you will find a folder for each file with the first 2 chars of the md5 hash and the rest as the filename within the folder. Next time, when the file changes, a different folder is created as the md5 hash changes.

How to use data in our ML code?

Only .csv.dvc file knows the file name and so it is really important not to delete those files, without which we can’t get the file from Cloud Storage.

Now within our script, we import dvc.api for various operations and get_url is an absolute basic one to get the link of the file that is tracked.

>>> import dvc.api
>>> dvc.api.get_url('data/game_results.csv')
'gs://ai-ml-dvc/predict-game-result/1d/f2045c428b6bbb05a4218328ee8bff'
>>> data_path = dvc.api.get_url('data/game_results.csv')

The rest of the code remains unchanged and there is no need to change the code unless the file name changes.

>>> import pandas as pd
>>> df = pd.read_csv(data_path)
>>> df.shape
(125, 37)

The data versions are being tracked by dvc whereas dvc config files and code versions are being tracked by git.

This is a perfect way to track data changes, including model objects and metrics. Also, this keeps data separate from code, but still tracks data changes, which is a prerequisite for machine learning projects.

Conclusion

dvc works alongside git and is a core component to have continuous machine learning with CI/CD tools and is the first and important step in the MLOps process. What git does to code, dvc does to data — to track and maintain the changes and later retrieve the versions when needed. Also, these 2 tools keep the code clean, with code separated from data (data can be often huge and have security issues) we will have better control over the whole process. We have looked into setting up with one such storage (Cloud Storage), but with dvc we can use AWS, Azure, SSH, HTTP, or even Google Drive, if working on small projects.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Avinash Kanumuru

Avinash Kanumuru

24 Followers

Data Scientist | Machine Learning Engineer | Data & Analytics Manager