Use DVC to version control ML/DL models

This story shows the method I use to version control models and data of my DL projects

JOJO
JOJO
Feb 7 · 4 min read

Introduction

Git is well known as a very useful version control tool which is widely used in programming projects. In ML/DL projects, it always involves with tons of data, however, it’s too enormous(always counted in MB, GB, TB scale) to use git. DVC gives a substitutable method for this task. You can think it of the pointer concept in programming, we don’t directly store the real data here, instead, we use a pointer which may point to some remote storage(e.g. S3).

Useful Commands

Basically, dvc offers very complete tutorial in their website, here I just mentioned some basic situation that I use it.

# To get started
dvc init
# Setting remote
dvc remote add -d <dvc-name> <remote-dvc-path>
# Add a data
dvc add <data>
# Push to remote
dvc push
# When we just pull all .dvc files, to draw all files down
dvc pull

Image the situation that we want to create a new project using dvc mechanism

# Add data we want to store remotely
dvc add data
# Confirm it and push it to remote registry just like git push
dvc push
git add data.dvcgit commitgit push

Image that we just pull a project down using dvc, and we generate a model, and then push it back

git pull<!Attention: remove all local data files first># Pull down all data we need according to .dvc files from remote
dvc pull
<run models, and generate a big model file>dvc add model.pkl
dvc push
git add model.pkl.dvc
git commit
git push

Integrated with Git

Image we released a latest version 1.3.0, and we want to train some models or do some experiments. Let’s think a lit bit ahead, after we do several experiments, it’s very likely we will compare them and choose the most successful one as our formal version. Therefore, how to keep the data and guarantee the reproducibility of each experiment is the point here.

We keep a branch for doing this task, each commit represents one experiment. Follow this manner, if we just want to reproduce a particular result, just checkout to the corresponding commit and then use dvc to get needed training data or trained models.

Sometimes, the code might be composed of lots of models for different usage. Let’s say transformers package, a python package providing BERT based models. It could be used for NER, classifier, etc. To do experiment, we could branch out for each usage. In other case, for companies doing business with many customers, each branch could also represent a individual customer and do a better managements of models and data.

Pipeline

To elaborate more, I define each commit by differing the pipelines, in our training process, we might have different ML pipelines, for examples, using different data, models, or evaluation methods. DVC helps us to define it clear, and then we just need to run dvc reproto reproduce it.

Actually, dvc allows us to define a pipeline here using dvc run command. For example, we define a very simple pipeline that first preprocessing data and then training models. We could use a pipeline.sh file to define it all.

"pipeline.sh"# Preprocess
dvc run -f preprocess.dvc \
-d src/preprocess.py -d data/train.json \
-o models/dataset.pkl \
"python3 src/preprocess.py --data data/train.json --dataset models/dataset.pkl --char"
# Train
dvc run -f train.dvc \
-d src/train.py -d models/dataset.pkl -d data/config.json \
-o models/model.pkl \
"python3 src/train.py --dataset models/dataset.pkl --config data/config.json --model models/model.pkl --char"

Each dvc run actually represents a stage in pipeline and will generate a .dvc file after execution. In our example, we will generate preprocess.dvc and train.dvc respectively. Finally, we can use dvc pipeline show --ascii train.dvc to visualize it.

With --outs parameters, we can see all the data managed by .dvc and how they work in the pipeline.

Conclusion

DVC could be used in many ways. Here, I share my personal usage cases. To conclude, dvc provides a better method for us to manage data and make reproducing models easier.

Reference

  1. DVC

NLP-trend-and-review-en

Review classic algorithms and latest paper

JOJO

Written by

JOJO

哈囉~這邊主要分享各種NLP的觀念跟論文閱讀心得,另外由於Medium格式問題,建議可以到我的Github Page觀看: https://tsupei.github.io

NLP-trend-and-review-en

Review classic algorithms and latest paper

More From Medium

More on Machine Learning from NLP-trend-and-review-en

More on Machine Learning from NLP-trend-and-review-en

Positional Embeddings

Also tagged Deep Learning

Also tagged Deep Learning

Understanding Domain Adaptation

Apr 4 · 7 min read

56

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade