Two years ago, when I started participating in real-world data science projects, I saw that the biggest missing piece in data science was to bind the datasets to the scripts and scripts to the training artifacts. Back then, my solution to the issue was to create configuration files to store all those connections in JSON format then call the scripts to build reproducible pipelines. And of course, I also had to write the necessary scripts to push the datasets and artifacts to the GPU servers in the cloud. However, I am not concerned with this issue anymore because I use DVC to accomplish all these :).
From my experience, whether it is a real-world data science project or it is a data science competition, there are two major key components for success. Those components are API simplicity and reproducible pipelines. Since data science means experimenting a lot in a limited time frame, first, we need machine learning tools with simplicity and second, we need reliable/reproducible machine learning pipelines. Thanks to tools like Keras, LightGBM, and fastai we already have simple yet powerful tools for rapid model development. And thanks to DVC, we are building large projects with reproducible pipelines very easily.
The aim of this post is to mention some cool features of DVC and, for those who haven’t heard it yet, add it in the favorites list.
DVC simply helps you add datasets in your git repository and bind these datasets with the scripts or commands to create reproducible pipelines for your colleagues and your future self.
Adding a dataset is as easy as typing the below command
dvc add path/to/dataset
I can easily add any large dataset in the repository and automatically ignore it from git. Assuming I already did the cloud settings, I can also easily push the dataset to the cloud via this command:
dvc push path/to/dataset.dvc
DVC supports for many remote storage services like s3, Google-drive, Azure, ssh, etc… And since I pushed the dataset to the could, if I clone the project on another machine, I can now download the dataset (and any other pushed artifacts) like this:
How about creating reproducible pipelines? Well, I simply use dvc run command for that. Whether it is a python script or any other terminal command like unzip I can use dvc run. For example, let's assume I downloaded the dataset in a zip file, so I first need to unzip the file via:
dvc -d dataset.zip -o dataset -f (project-root)/stages/dataset/unzip.dvc run unzip -q dataset.zip -d dataset
Here -d stands for dependency and means that in order to reproduce this unzipping stage it needs the dataset.zip file first. And -o is for output meaning that dvc adds the dataset folder to the cache and ignores from git (as it was also the case for dvc add command above). And as you probably guess the path defined with -f is the path to the dvc metafile. I can push the output(s) using this metafile:
dvc push stages/dataset/unzip.dvc
Now let's assume I have a training script named train.py and want to add the training stage in the pipeline and push the artifacts to the remote storage. An example command would be something like this for that:
dvc -d path/to/dataset -d train.py-o path/to/model.h5 -m path/to/metric1.json -f stages/train/model1.dvc python train.py --path path/to/dataset --save path/to/model.h5 --metric path/to/metric1.json
As you can see, I am storing all the connections in this command and it is preserved in stages/train/model1.dvc metafile. So whenever I want to reproduce the training I can then simply do:
dvc repro stages/train/model1.dvc
DVC checks the pipeline starting from the unzipping stage since it’s output dataset folder is defined as a dependency in the training stage. Of course, if I want to skip running everything from scratch on other machines, I can always push the outputs via dvc push and dvc pull in the new repository.
Lastly, as you may have noticed, there is the -m argument in the dvc run example which states that the output is a metric file. The cool thing about the metric files with DVC is, I can run many experiments on different git branches and check the training scores on all branches in one command. This helps me to see the big picture :) and merge only the good scored experiments into master.
I hope you like my short story with DVC :).