Use DVC to version control ML/DL models

This story shows the method I use to version control models and data of my DL projects

CHEN TSU PEI
NLP-trend-and-review-en
4 min readFeb 7, 2020

--

Introduction

Git is well known as a very useful version control tool which is widely used in programming projects. In ML/DL projects, it always involves with tons of data, however, it’s too enormous(always counted in MB, GB, TB scale) to use git. DVC gives a substitutable method for this task. You can think it of the pointer concept in programming, we don’t directly store the real data here, instead, we use a pointer which may point to some remote storage(e.g. S3).

Useful Commands

Basically, dvc offers very complete tutorial in their website, here I just mentioned some basic situation that I use it.

Image the situation that we want to create a new project using dvc mechanism

Image that we just pull a project down using dvc, and we generate a model, and then push it back

Integrated with Git

Image we released a latest version 1.3.0, and we want to train some models or do some experiments. Let’s think a lit bit ahead, after we do several experiments, it’s very likely we will compare them and choose the most successful one as our formal version. Therefore, how to keep the data and guarantee the reproducibility of each experiment is the point here.

a new branch for dvc

We keep a branch for doing this task, each commit represents one experiment. Follow this manner, if we just want to reproduce a particular result, just checkout to the corresponding commit and then use dvc to get needed training data or trained models.

Branches could be used to differ purposes

Sometimes, the code might be composed of lots of models for different usage. Let’s say transformers package, a python package providing BERT based models. It could be used for NER, classifier, etc. To do experiment, we could branch out for each usage. In other case, for companies doing business with many customers, each branch could also represent a individual customer and do a better managements of models and data.

Pipeline

To elaborate more, I define each commit by differing the pipelines, in our training process, we might have different ML pipelines, for examples, using different data, models, or evaluation methods. DVC helps us to define it clear, and then we just need to run dvc reproto reproduce it.

Actually, dvc allows us to define a pipeline here using dvc run command. For example, we define a very simple pipeline that first preprocessing data and then training models. We could use a pipeline.sh file to define it all.

Each dvc run actually represents a stage in pipeline and will generate a .dvc file after execution. In our example, we will generate preprocess.dvc and train.dvc respectively. Finally, we can use dvc pipeline show --ascii train.dvc to visualize it.

dvc pipeline show — ascii train.dvc

With --outs parameters, we can see all the data managed by .dvc and how they work in the pipeline.

dvc pipeline show — ascii train.dvc — outs

Conclusion

DVC could be used in many ways. Here, I share my personal usage cases. To conclude, dvc provides a better method for us to manage data and make reproducing models easier.

Reference

  1. DVC

--

--

CHEN TSU PEI
NLP-trend-and-review-en

這邊停止更新了!麻煩移駕到https://tsupei.github.io,有持續更新更多NLP的文章唷!