Conceptual Overview on Data version control — DVC
In software engineering, we often versioned our code with Git but traditional version control systems that are used for regular software projects aren’t quite sufficient for machine learning, as they need to be able to track the data sets, along with the code itself and resulting models. The structure of data sets and models without any data control system particularly in machine learning may turn complex.To overcome this problem and fix this issue, one has to follow up with the data control system.
DVC is a command-line tool written in Python. It mimics Git commands and workflows to ensure that users can quickly incorporate it into their regular Git practice. This helps the data scientist to manage, track and version data and models, as well as run reproducible experiments.
In nutshell, DVC is the git for ML3. There are a number of alternatives but still choosing DVC is worth it since it is ease of use, similar functionalities to that of Git and the possibility of integration with ML pipelines as a part of an extensive system.
While Git is used to store the version code, DVC does the same for data and model files. In general, any remote repository that needs to store your data and model requires a lightweight file named .dvc. This file is meant to be stored in github along with the repository so that the data associated with the file can be scrapped out.
To prepare your workspace, you’ll take the following steps:
- Create and activate a virtual environment say conda.
conda activate dvc
2. Install DVC and its prerequisite Python libraries of your wish.
python -m pip install dvc scikit-learn scikit-image pandas numpy
3. Clone a GitHub repository with all the code.
In the top right of the screen click fork and select the private account in the window that pops out. Clone the forked repository to your computer with the git clone
command and position your command line inside the repository folder
git clone https://github.com/YourUsername/data-version-control
cd data-version-control
YourUsername in the above command is to be replaced with actual Username of yours.
The below pasted image contains the list of 6 mandatory files to be included in the repository and hence the folder structure in the repository is divided
data-version-control/
|
├── data/
│ ├── prepared/
│ └── raw/
|
├── metrics/
├── model/
└── src/
├── evaluate.py
├── prepare.py
└── train.py
src/ folder has 3 subdivisons of working sets, namely
prepare.py
train.py
evaluate.py
4. Final step is to work with dataset
So choosing kaggle or Imagenette to adopt the container images is worthful. Download the dataset.The dataset is structured in a particular way. It has two main folders:
train/
includes images used for training a model.val/
includes images used for validating a model.
The train/
and val/
folders are further divided into multiple folders. These folder has a code that has classes that is subdivided into image datasets. For example if you assign to take any two kind of objects, the model will accept images and work to classify it as which one it belongs to.This kind of problem, in which a model decides between two kinds of objects, is called binary classification.
Data can also be collected using curl command in this case. Hence dataset can be scrapped out and moved to a folder. Henceforth, the set up is ready.
Copy the train/
and val/
folders and put them into your new repository, in the data/raw/
folder. Else follow an alternative method using the curl commands
curl https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz \-O imagenette2-160.tgz
One can then extract the dataset and pile them into folders. finally the archive and the extracted folder can be removed .
WELL SET!!! Your system is ready with its setup for DVC
HOW DVC WORKS WITH GIT TO MANAGE CODE AND DATA?
The downloaded dataset from Imagenette is required to practice the DVC basics. let’s start up with the branch for our experimental beginning.
git checkout -b "xyz"
The meaning of this command is that the git checkout alters the current branch while -b notifies that this new branch is to be created. Secondly, you need to initialize DVC. This would create a .dvc folder that would be similar to .git folder
dvc init
You need some kind of remote storage for the data and model files controlled by DVC. This can be as simple as another folder on your system. Create a folder somewhere on your system outside the data-version-control/
repository and call it dvc_remote
.
Now come back to your data-version-control/
repository and tell DVC where the remote storage is on your system:
$ dvc remote add -d remote_storage path/to/your/dvc_remote
DVC supports many cloud-based storage systems, such as AWS S3 buckets, Google Cloud Storage, and Microsoft Azure Blob Storage. Your repository is now initialized and ready for work. You’ll cover three basic actions:
- Tracking files
- Uploading files
- Downloading files
The basic rule of thumb you’ll follow is that small files go to GitHub, and large files go to DVC remote storage.Tracking Files
Tracking Files
Images are considered large files, especially if they’re collected into datasets with hundreds or thousands of files. The add
command adds these two folders under DVC control. Here’s what DVC does under the hood:
- Adds your
train/
andval/
folders to.gitignore
- Creates two files with the
.dvc
extension,train.dvc
andval.dvc
- Copies the
train/
andval/
folders to a staging area
When you run dvc add train/
, the folder with large files goes under DVC control, and the small .dvc
and .gitignore
files go under Git control. The train/
folder also goes into the staging area, or cache:
git add --all
The --all
switch adds all files that are visible to Git to the staging area.
Now all the files are under the control of their respective version control systems:
Uploading Files
To upload your files from the cache to the remote, use the push
command:
$ dvc push
DVC will look through all your repository folders to find .dvc
files. As mentioned, these files will tell DVC what data needs to be backed up, and DVC will copy them from the cache to remote storage.
Downloading Files
You can remove the entire val/
folder, but make sure that the .dvc
file doesn’t get removed:
$ rm -rf data/raw/val
This will delete the data/raw/val/
folder from your repository, but the folder is still safely stored in your cache and the remote storage. You can get it back at any time.
To get your data back from the cache, use the dvc checkout
command:
$ dvc checkout data/raw/val.dvc
Your data/raw/val/
folder has been restored. If you want DVC to search through your whole repository and check out everything that’s missing, then use dvc checkout
with no additional arguments.
When you clone your GitHub repository on a new machine, the cache will be empty. The fetch
command gets the contents of the remote storage into the cache:
$ dvc fetch data/raw/val.dvc
Or you can use just dvc fetch
to get the data for all DVC files in the repository. Once the data is in your cache, check it out to the repository with dvc checkout
. You can perform both fetch
and checkout
with a single command, dvc pull
:
$ dvc pull
dvc pull
executes dvc fetch
followed by dvc checkout
. It copies your data from the remote to the cache and into your repository in a single sweep. These commands roughly mimic what Git does, since Git also has fetch
, checkout
, and pull
commands:
CONCLUSION
At the end of this article, you would have got a clear cut vision of DVC and its ways of execution in cloud platforms. Hope to reach you all with a new interesting ML and DS topics.