Use DVC to version control ML/DL models
This story shows the method I use to version control models and data of my DL projects
Git is well known as a very useful version control tool which is widely used in programming projects. In ML/DL projects, it always involves with tons of data, however, it’s too enormous(always counted in MB, GB, TB scale) to use git. DVC gives a substitutable method for this task. You can think it of the pointer concept in programming, we don’t directly store the real data here, instead, we use a pointer which may point to some remote storage(e.g. S3).
Basically, dvc offers very complete tutorial in their website, here I just mentioned some basic situation that I use it.
# To get started
dvc init# Setting remote
dvc remote add -d <dvc-name> <remote-dvc-path># Add a data
dvc add <data># Push to remote
dvc push# When we just pull all .dvc files, to draw all files down
Image the situation that we want to create a new project using dvc mechanism
# Add data we want to store remotely
dvc add data# Confirm it and push it to remote registry just like git push
dvc pushgit add data.dvcgit commitgit push
Image that we just pull a project down using dvc, and we generate a model, and then push it back
git pull<!Attention: remove all local data files first># Pull down all data we need according to .dvc files from remote
dvc pull<run models, and generate a big model file>dvc add model.pkl
git add model.pkl.dvc
Integrated with Git
Image we released a latest version 1.3.0, and we want to train some models or do some experiments. Let’s think a lit bit ahead, after we do several experiments, it’s very likely we will compare them and choose the most successful one as our formal version. Therefore, how to keep the data and guarantee the reproducibility of each experiment is the point here.
We keep a branch for doing this task, each commit represents one experiment. Follow this manner, if we just want to reproduce a particular result, just checkout to the corresponding commit and then use dvc to get needed training data or trained models.
Sometimes, the code might be composed of lots of models for different usage. Let’s say transformers package, a python package providing BERT based models. It could be used for NER, classifier, etc. To do experiment, we could branch out for each usage. In other case, for companies doing business with many customers, each branch could also represent a individual customer and do a better managements of models and data.
To elaborate more, I define each commit by differing the pipelines, in our training process, we might have different ML pipelines, for examples, using different data, models, or evaluation methods. DVC helps us to define it clear, and then we just need to run
dvc reproto reproduce it.
Actually, dvc allows us to define a pipeline here using
dvc run command. For example, we define a very simple pipeline that first preprocessing data and then training models. We could use a
pipeline.sh file to define it all.
dvc run -f preprocess.dvc \
-d src/preprocess.py -d data/train.json \
-o models/dataset.pkl \
"python3 src/preprocess.py --data data/train.json --dataset models/dataset.pkl --char"# Train
dvc run -f train.dvc \
-d src/train.py -d models/dataset.pkl -d data/config.json \
-o models/model.pkl \
"python3 src/train.py --dataset models/dataset.pkl --config data/config.json --model models/model.pkl --char"
dvc run actually represents a stage in pipeline and will generate a
.dvc file after execution. In our example, we will generate
train.dvc respectively. Finally, we can use
dvc pipeline show --ascii train.dvc to visualize it.
--outs parameters, we can see all the data managed by
.dvc and how they work in the pipeline.
DVC could be used in many ways. Here, I share my personal usage cases. To conclude, dvc provides a better method for us to manage data and make reproducing models easier.