Conceptual Overview on Data version control — DVC

Jeevitha M
featurepreneur
Published in
6 min readApr 3, 2022
LOGO OF DATA VERSION CONTROL SYSTEM

In software engineering, we often versioned our code with Git but traditional version control systems that are used for regular software projects aren’t quite sufficient for machine learning, as they need to be able to track the data sets, along with the code itself and resulting models. The structure of data sets and models without any data control system particularly in machine learning may turn complex.To overcome this problem and fix this issue, one has to follow up with the data control system.

DVC is a command-line tool written in Python. It mimics Git commands and workflows to ensure that users can quickly incorporate it into their regular Git practice. This helps the data scientist to manage, track and version data and models, as well as run reproducible experiments.

In nutshell, DVC is the git for ML3. There are a number of alternatives but still choosing DVC is worth it since it is ease of use, similar functionalities to that of Git and the possibility of integration with ML pipelines as a part of an extensive system.

While Git is used to store the version code, DVC does the same for data and model files. In general, any remote repository that needs to store your data and model requires a lightweight file named .dvc. This file is meant to be stored in github along with the repository so that the data associated with the file can be scrapped out.

To prepare your workspace, you’ll take the following steps:

  1. Create and activate a virtual environment say conda.
conda activate dvc

2. Install DVC and its prerequisite Python libraries of your wish.

python -m pip install dvc scikit-learn scikit-image pandas numpy

3. Clone a GitHub repository with all the code.

In the top right of the screen click fork and select the private account in the window that pops out. Clone the forked repository to your computer with the git clone command and position your command line inside the repository folder

git clone https://github.com/YourUsername/data-version-control
cd data-version-control

YourUsername in the above command is to be replaced with actual Username of yours.

The below pasted image contains the list of 6 mandatory files to be included in the repository and hence the folder structure in the repository is divided

data-version-control/
|
├── data/
│ ├── prepared/
│ └── raw/
|
├── metrics/
├── model/
└── src/
├── evaluate.py
├── prepare.py
└── train.py

src/ folder has 3 subdivisons of working sets, namely

prepare.py
train.py
evaluate.py

4. Final step is to work with dataset

So choosing kaggle or Imagenette to adopt the container images is worthful. Download the dataset.The dataset is structured in a particular way. It has two main folders:

  1. train/ includes images used for training a model.
  2. val/ includes images used for validating a model.

The train/ and val/ folders are further divided into multiple folders. These folder has a code that has classes that is subdivided into image datasets. For example if you assign to take any two kind of objects, the model will accept images and work to classify it as which one it belongs to.This kind of problem, in which a model decides between two kinds of objects, is called binary classification.

Data can also be collected using curl command in this case. Hence dataset can be scrapped out and moved to a folder. Henceforth, the set up is ready.

Copy the train/ and val/ folders and put them into your new repository, in the data/raw/ folder. Else follow an alternative method using the curl commands

curl https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz \-O imagenette2-160.tgz

One can then extract the dataset and pile them into folders. finally the archive and the extracted folder can be removed .

WELL SET!!! Your system is ready with its setup for DVC

HOW DVC WORKS WITH GIT TO MANAGE CODE AND DATA?

The downloaded dataset from Imagenette is required to practice the DVC basics. let’s start up with the branch for our experimental beginning.

git checkout -b "xyz"

The meaning of this command is that the git checkout alters the current branch while -b notifies that this new branch is to be created. Secondly, you need to initialize DVC. This would create a .dvc folder that would be similar to .git folder

dvc init

You need some kind of remote storage for the data and model files controlled by DVC. This can be as simple as another folder on your system. Create a folder somewhere on your system outside the data-version-control/ repository and call it dvc_remote.

Now come back to your data-version-control/ repository and tell DVC where the remote storage is on your system:

$ dvc remote add -d remote_storage path/to/your/dvc_remote

DVC supports many cloud-based storage systems, such as AWS S3 buckets, Google Cloud Storage, and Microsoft Azure Blob Storage. Your repository is now initialized and ready for work. You’ll cover three basic actions:

  1. Tracking files
  2. Uploading files
  3. Downloading files

The basic rule of thumb you’ll follow is that small files go to GitHub, and large files go to DVC remote storage.Tracking Files

Tracking Files

Images are considered large files, especially if they’re collected into datasets with hundreds or thousands of files. The add command adds these two folders under DVC control. Here’s what DVC does under the hood:

  1. Adds your train/ and val/ folders to .gitignore
  2. Creates two files with the .dvc extension, train.dvc and val.dvc
  3. Copies the train/ and val/ folders to a staging area

When you run dvc add train/, the folder with large files goes under DVC control, and the small .dvc and .gitignore files go under Git control. The train/ folder also goes into the staging area, or cache:

git add --all

The --all switch adds all files that are visible to Git to the staging area.

Now all the files are under the control of their respective version control systems:

Uploading Files

To upload your files from the cache to the remote, use the push command:

$ dvc push

DVC will look through all your repository folders to find .dvc files. As mentioned, these files will tell DVC what data needs to be backed up, and DVC will copy them from the cache to remote storage.

Downloading Files

You can remove the entire val/ folder, but make sure that the .dvc file doesn’t get removed:

$ rm -rf data/raw/val

This will delete the data/raw/val/ folder from your repository, but the folder is still safely stored in your cache and the remote storage. You can get it back at any time.

To get your data back from the cache, use the dvc checkout command:

$ dvc checkout data/raw/val.dvc

Your data/raw/val/ folder has been restored. If you want DVC to search through your whole repository and check out everything that’s missing, then use dvc checkout with no additional arguments.

When you clone your GitHub repository on a new machine, the cache will be empty. The fetch command gets the contents of the remote storage into the cache:

$ dvc fetch data/raw/val.dvc

Or you can use just dvc fetch to get the data for all DVC files in the repository. Once the data is in your cache, check it out to the repository with dvc checkout. You can perform both fetch and checkout with a single command, dvc pull:

$ dvc pull

dvc pull executes dvc fetch followed by dvc checkout. It copies your data from the remote to the cache and into your repository in a single sweep. These commands roughly mimic what Git does, since Git also has fetch, checkout, and pull commands:

CONCLUSION

At the end of this article, you would have got a clear cut vision of DVC and its ways of execution in cloud platforms. Hope to reach you all with a new interesting ML and DS topics.

--

--

Jeevitha M
featurepreneur

AWS Cloud Captain Cohort 3 - India Region , Interested in Data Analytics , Big data, Cloud Computing [AWS/Azure]. A newbie in security . #Womenintech #Research