Use DVC to version control ML/DL models

This story shows the method I use to version control models and data of my DL projects

Published in

NLP-trend-and-review-en

4 min readFeb 7, 2020

Introduction

Git is well known as a very useful version control tool which is widely used in programming projects. In ML/DL projects, it always involves with tons of data, however, it’s too enormous(always counted in MB, GB, TB scale) to use git. DVC gives a substitutable method for this task. You can think it of the pointer concept in programming, we don’t directly store the real data here, instead, we use a pointer which may point to some remote storage(e.g. S3).

Useful Commands

Basically, dvc offers very complete tutorial in their website, here I just mentioned some basic situation that I use it.

# To get started
dvc init# Setting remote
dvc remote add -d <dvc-name> <remote-dvc-path># Add a data
dvc add <data># Push to remote
dvc push# When we just pull all .dvc files, to draw all files down
dvc pull

Image the situation that we want to create a new project using dvc mechanism

# Add data we want to store remotely
dvc add data# Confirm it and push it to remote registry just like git push
dvc pushgit add data.dvcgit commitgit push

Image that we just pull a project down using dvc, and we generate a model, and then push it back

git pull<!Attention: remove all local data files first># Pull down all data we need according to .dvc files from remote
dvc pull<run models, and generate a big model file>dvc add model.pkl
dvc push
git add model.pkl.dvc
git commit 
git push

Integrated with Git

Image we released a latest version 1.3.0, and we want to train some models or do some experiments. Let’s think a lit bit ahead, after we do several experiments, it’s very likely we will compare them and choose the most successful one as our formal version. Therefore, how to keep the data and guarantee the reproducibility of each experiment is the point here.

We keep a branch for doing this task, each commit represents one experiment. Follow this manner, if we just want to reproduce a particular result, just checkout to the corresponding commit and then use dvc to get needed training data or trained models.

Branches could be used to differ purposes

Sometimes, the code might be composed of lots of models for different usage. Let’s say transformers package, a python package providing BERT based models. It could be used for NER, classifier, etc. To do experiment, we could branch out for each usage. In other case, for companies doing business with many customers, each branch could also represent a individual customer and do a better managements of models and data.

Pipeline

To elaborate more, I define each commit by differing the pipelines, in our training process, we might have different ML pipelines, for examples, using different data, models, or evaluation methods. DVC helps us to define it clear, and then we just need to run dvc reproto reproduce it.

Actually, dvc allows us to define a pipeline here using dvc run command. For example, we define a very simple pipeline that first preprocessing data and then training models. We could use a pipeline.sh file to define it all.

"pipeline.sh"# Preprocess
dvc run -f preprocess.dvc \
        -d src/preprocess.py -d data/train.json \
        -o models/dataset.pkl \
        "python3 src/preprocess.py --data data/train.json --dataset     models/dataset.pkl --char"# Train 
dvc run -f train.dvc \
        -d src/train.py -d models/dataset.pkl -d data/config.json \
        -o models/model.pkl \
        "python3 src/train.py --dataset models/dataset.pkl --config data/config.json --model models/model.pkl --char"

Each dvc run actually represents a stage in pipeline and will generate a .dvc file after execution. In our example, we will generate preprocess.dvc and train.dvc respectively. Finally, we can use dvc pipeline show --ascii train.dvc to visualize it.

With --outs parameters, we can see all the data managed by .dvc and how they work in the pipeline.

dvc pipeline show — ascii train.dvc — outs

Conclusion

DVC could be used in many ways. Here, I share my personal usage cases. To conclude, dvc provides a better method for us to manage data and make reproducing models easier.