Data Version Control Tutorial

Dmitry Petrov
Data Version Control
15 min readMar 20, 2018

UPDATE 3/4/2019: Code examples in this tutorial are outdated. Please use the updated tutorial https://dvc.org/doc/tutorial

Today the data science community is still lacking good practices for organizing their projects and effectively collaborating. ML algorithms and methods are no longer simple “tribal knowledge” but are still difficult to implement, manage and reuse.

One of the biggest challenges in reusing, and hence the managing of ML projects, is its reproducibility.

To address the reproducibility we have build Data Version Control or DVC.

This example shows you how to solve a text classification problem using the DVC tool.

Git branches should beautifully reflect the non-linear structure common to the ML process, where each hypotheses can be presented as a Git branch. However, inability to store data in a repository and the discrepancy between code and data make it extremely difficult to manage a data science project with Git.

DVC streamlines large data files and binary models into a single Git environment and this approach will not require storing binary files in your Git repository.

1. Preparation

1.1. What we are going to do?

In this document we will be building an ML classification model which classify stackoverflow questions by two classes: with “python” tag and without “python” tag. For training purposes a small subset of data will be used — only 180Mb xml files.

Most of the code for the problem is ready and will be downloaded in the first steps. Later we will be modifying the code a bit to improve the model.

1.2. Getting the sample code

Take the following steps to initialize a new Git repository and get the sample code into it:

$ mkdir classify
$ cd classify
$ git init
$ mkdir code
$ S3_DIR=https://s3-us-west-2.amazonaws.com/dvc-share/so
$ wget -nv -P code/ \
$S3_DIR/code/featurization.py \
$S3_DIR/code/evaluate.py \
$S3_DIR/code/train_model.py \
$S3_DIR/code/split_train_test.py \
$S3_DIR/code/xml_to_tsv.py \
$S3_DIR/code/conf.py \
$S3_DIR/code/requirements.txt
$ git add code/
$ git commit -m 'Download code'

Install the code requirements:

$ pip install -r code/requirements.txt

1. 3. Install DVC

Now DVC software should be installed. The best way to install DVC is a system dependent package. DVC supports all common operation systems: Max OS X, Linux and Windows. You can find the latest version of the package on the DVC website: dataversioncontrol.com

Alternatively, you can install DVC by Python package manager — PIP if you use Python:

$ pip install dvc

1.4. Initialize

DVC works on top of Git repositories. You run DVC initialization in a repository directory to create DVC metafiles and directories.

After DVC initialization, a new directory .dvc will be created with config and .gitignore files and cache directory. These files and directories are hidden from the user in general and the user does not interact with these files directly. However, we describe DVC internals a bit for better understanding on how it works.

$ dvc init
$ ls -a .dvc
./ ../ .gitignore cache/ config
$ git status -s
A .dvc/.gitignore
A .dvc/config
$ cat .dvc/.gitignore
cache
state
lock
$ git commit -m 'Init DVC'

.dvc/cache directory is one of the most important parts of any DVC repositories. The directory contains all content of data files and will be described in the next chapter with more detail. The most important part about this directory is that .dvc/.gitignore file is containing this directory which means that the cache directory is not under Git control — this is your local directory and you cannot push it to any Git remote.

2. Define ML pipeline

2.1. Get data file

To include a data file into your data science environment you need to copy the file in on of the the repository directories. We create a special directory data for the data files and download 40MB source data file into this directory.

$ mkdir data
$ wget -P data $S3_DIR/100K/Posts.xml.tgz
$ du -sh data/*
40M data/Posts.xml.tgz

This data/Posts.xml.tgz is still just a regular file. Now it is time to move the file under DVC control by $ dvc add command. After the command execution you will see a new file data/Posts.xml.tgz.dvc and there is a change in data/.gitignore. Both of these files have to be committed into the repository.

$ dvc add data/Posts.xml.tgz
$ du -sh data/*
40M data/Posts.xml.tgz
4.0K data/Posts.xml.tgz.dvc
$ git status -s data/
?? data/.gitignore
?? data/Posts.xml.tgz.dvc
$ git add .
$ git commit -m 'Add source dataset'

You have probably already noticed that the actual file was not committed to the repository. This happened because DVC included the file into data/.gitignore and Git ignores this data file from now.

Excluding large data files from the Git repository by including them to .gitignore is a general DVC behavior.

2.2. Data file internals

If you take a look at the DVC-file you will see that only outputs are defined in outs. In this file only one output is defined. The output contains the data file path in the repository and md5 cache. This md5 cache determines location of the actual content file in DVC cache directory .dvc/cache.

Output from DVC-file defines the relationship between the data file path in a repository and the path in a cache directory.

$ cat data/Posts.xml.tgz.dvc
cmd: null
deps: []
outs:
- cache: true
md5: 5988519f8465218abb23ce0e0e8b1384
path: Posts.xml.tgz
$ du -sh .dvc/cache/59/*
40M .dvc/cache/59/88519f8465218abb23ce0e0e8b1384

Keeping actual file content in a cache directory and a copy of the caches in user workspace during $ git checkout is a regular trick that Git uses and Git-LFS (Git for Large File Storage). This trick works fine for tracking small files with source code. For large data files this might not be the best approach. Because checkout operation for a 10Gb data file might take many seconds and 50GB file checkout (think copy) might take a couple minutes.

DVC was designed with large data files in mind. This means gigabytes or even hundreds of gigabytes in file size. Instead of copying files from cache to workspace DVC created hardlinks.

This is pretty similar to what Git-annex does. Creating file hardlinks is a quick operation. So, with DVC you can easily checkout a few dozens of files with any size. And hardlink does not require you to have twice as much space in the hard drive. Even if each of the files contains 40MB of data the overall size of the repository is still 40MB. And both of the files are corresponded to the same inode (actual file content) in a file system. Use $ ls -l to see file system inodes:

$ ls -i data/Posts.xml.tgz
78483929 data/Posts.xml.tgz
$ ls -i .dvc/cache/59/
78483929 88519f8465218abb23ce0e0e8b1384
$ du -sh .
41M .

Note, DVC uses hardlinks in all the supported OS including Mac OS, Linux and Windows. Some details (like inode) might differ where the overall DVC behavior is the same.

2.3. Running commands

Once data source files are in the workspace you can start processing the data and train ML models out of the data files. DVC helps you to define steps of your ML process and pipe them together into an ML pipeline.

Command $ dvc run executes any command that you pass into it as a list of parameters. However, the command alone is not as interesting as a command in a pipeline. The command can be piped by it’s dependencies and output files. Dependencies and outputs include input files, input directories and source code files or directories.

  1. Option -d file.tsv should be used to specify a dependency file or directory. The dependency can be a regular file from a repository or a data file.
  2. -O file.tsv (big O) specifies regular output file.
  3. -o file.tsv (small O) specifies output data file which means DVC will transforms this file into a data file (think — it will run $ dvc add file.tsv).

It is important to specify the dependencies and the outputs of the run command before the list of the command to run.

Let see how a unarchiving command $ tar works under DVC:

$ dvc run -d data/Posts.xml.tgz -o data/Posts.xml \
tar zxf data/Posts.xml.tgz -C data/

Using 'Posts.xml.dvc' as a stage file
Reproducing 'Posts.xml.dvc':
tar zxf data/Posts.xml.tgz -C data/
$ du -sh data/*
145M data/Posts.xml
40M data/Posts.xml.tgz
4.0K data/Posts.xml.tgz.dvc

Option -C specifies output directory for the tar command. “-d data/Posts.xml.tgz” defines the input file and “-o data/Posts.xml” — output data file.

DVC runs the command and does some additional work if the command was successful:

  1. The command unarchive data file data/Posts.xml.tgz to a regular file data/Posts.xml. The command knows nothing about data files and DVC.
  2. DVC transforms all the outputs “-o” files into data files. It is like applying $ dvc add file1 for each of the outputs. As a result, all the actual data files content goes to the cache directory .dvc/cache and each of the filenames will be added to .gitignore.
  3. For reproducibility purposes, DVC creates the DVC-file Posts.xml.dvc — the file with meta-information about the command. DVC assigns a name to the DVC-file based on the first output file name by adding the .dvc suffix at the end (can be changed by “-f” option).

DVC-file example:

$ cat Posts.xml.dvc
cmd: tar zxf data/Posts.xml.tgz -C data/
deps:
- md5: 5988519f8465218abb23ce0e0e8b1384
path: data/Posts.xml.tgz
outs:
- cache: true
md5: cfdaa4bba57fa07d81ff96685a9aab2c
path: data/Posts.xml
  • cmd — the command to run.
  • deps — dependencies with md5 checksums.
  • outs — outputs with md5 checksums.

As previously with $ dvc add command data/.gitignore file was modified. Now it includes the unarchive command output file Posts.xml.

$ git status -s
M data/.gitignore
?? Posts.xml.dvc
$ cat data/.gitignore
Posts.xml.tgz
Posts.xml

It is important than output Posts.xml file was transformed by the DVC into a data file in accordance with the -o option.

You can find the corresponded cache file by the checksum which starts from cfdaa4b according to the DVC-file Posts.xml.dvc:

$ ls .dvc/cache/
2f/ a8/
$ du -sh .dvc/cache/2f/* .dvc/cache/a8/*
40M .dvc/cache/59/88519f8465218abb23ce0e0e8b1384
145M .dvc/cache/cf/daa4bba57fa07d81ff96685a9aab2c
$ du -sh .
186M .

Let’s commit the result of the unarchived command. This is the first step of our ML pipeline.

$ git add .
$ git commit -m Unarchive

2.4. Running in a bulk

One single step of our ML pipeline was defined and committed into repository. It is not necessary to commit steps right after steps definition. You can run a few steps and commit later.

Let’s run the next step of converting a XML file to TSV and the following step of separating training and testing datasets one by one:

$ dvc run -d data/Posts.xml -d code/xml_to_tsv.py -d code/conf.py \
-o data/Posts.tsv \
python code/xml_to_tsv.py

Using 'Posts.tsv.dvc' as a stage file
Reproducing 'Posts.tsv.dvc':
python code/xml_to_tsv.py
$ dvc run -d data/Posts.tsv -d code/split_train_test.py \
-d code/conf.py \
-o data/Posts-test.tsv -o data/Posts-train.tsv \
python code/split_train_test.py 0.33 20180319

Using 'Posts-test.tsv.dvc' as a stage file
Reproducing 'Posts-test.tsv.dvc':
python code/split_train_test.py 0.33 20180319
Positive size 2049, negative size 97951

The result of the steps are two DVC-files corresponding to each of the commands Posts-test.tsv.dvc and Posts.tsv.dvc. Also, a code/conf.pyc file was created. These type of files should not be tracked by Git. Let’s manually include this type of file into .gitignore.

$ git status -s
M data/.gitignore
?? Posts-test.tsv.dvc
?? Posts.tsv.dvc
?? code/conf.pyc
$ echo "*.pyc" >> .gitignore

Both of the steps can be committed to the repository together.

$ git add .
$ git commit -m 'Process to TSV and separate test and train'

Let’s run and commit the following steps of the pipeline. Define the feature extraction step which takes train and test TSVs and generates corresponding matrix files:

$ dvc run -d code/featurization.py -d code/conf.py \
-d data/Posts-train.tsv -d data/Posts-test.tsv \
-o data/matrix-train.p -o data/matrix-test.p \
python code/featurization.py

Using 'matrix-train.p.dvc' as a stage file
Reproducing 'matrix-train.p.dvc':
python code/featurization.py
The input data frame data/Posts-train.tsv size is (66999, 3)
The output matrix data/matrix-train.p size is (66999, 5002) and data type is float64
The input data frame data/Posts-test.tsv size is (33001, 3)
The output matrix data/matrix-test.p size is (33001, 5002) and data type is float64

Train model out of the train matrix file:

$ dvc run -d data/matrix-train.p -d code/train_model.py \
-d code/conf.py -o data/model.p \
python code/train_model.py 20180319

Using 'model.p.dvc' as a stage file
Reproducing 'model.p.dvc':
python code/train_model.py 20180319
Input matrix size (66999, 5002)
X matrix size (66999, 5000)
Y matrix size (66999,)

And evaluate the result by the trained model and the test feature matrix:

$ dvc run -d data/model.p -d data/matrix-test.p \
-d code/evaluate.py -d code/conf.py -o data/eval.txt \
-f Dvcfile \
python code/evaluate.py

Reproducing 'Dvcfile':
python code/evaluate.py

The model evaluation step is the last one. To make it a reproducibility goal by default we specify DVC-file as Dvcfile. This will be discussed in the next chapter in more details.

The result of the last three run commands execution is three DVC-files and modified .gitignore file. All the changes should be committed into Git.

$ git status -s
M data/.gitignore
?? Dvcfile
?? matrix-train.p.dvc
?? model.p.dvc
$ git add .
$ git commit -m Evaluate

The evaluation step output contains the target metrics value in a simple text form:

$ cat data/eval.txt
AUC: 0.624652

This is probably not the best AUC that you have seen. In this document our focus is DVC, not ML modeling and we use a relatively small dataset without any advanced ML technics.

In the next chapter we will try to improve the metrics by changing our modeling code and using reproducibility in the pipeline regeneration.

3. Reproducibility

3.1. How reproducibility works?

The most exciting part of DVC is reproducibility.

Reproducibility is the time you are getting benefits out of DVC instead of spending time defining the ML pipelines.

DVC tracks all the dependencies, which helps you iterate on ML models faster without thinking what was affected by your last change.

In order to track all the dependencies DVC finds and reads ALL the DVC-files in a repository and builds a dependency graph (DAG) based on these files.

This is one of the differences between DVC reproducibility and traditional Makefile-like build automation tools (Make, Maven, Ant, Rakefile etc). It was designed in a such way to localize specification of DAG nodes.

If you run repro on any created DVC-file from our repository nothing happens because nothing was changed in the defined pipeline.

# Nothing to reproduce:
$ dvc repro model.p.dvc

By default $ dvc repro reads DVC-file with name Dvcfile:

# Reproduce Dvcfile.
# But it is still nothing to reproduce:
$ dvc repro

3.2. Adding bigrams

Our NLP model was based on unigrams only. Let’s improve the model by adding bigrams. The bigram model will extract signals not only from separate words but also from two word combinations. This eventually increases the number of features for the model and hopefully improves the target metric.

Before editing the code/featurization.py file please create and checkout a new branch bigrams.

$ git checkout -b bigram
# Please use your favorite text editor:
$ vi code/featurization.py

Specify ngram parameter in CountVectorizer (lines 50–53) and increase number of features to 6000:

bag_of_words = CountVectorizer(stop_words='english',
max_features=6000,
ngram_range=(1, 2))

Reproduce the pipeline:

$ dvc repro
Reproducing 'matrix-train.p.dvc':
python code/featurization.py
The input data frame data/Posts-train.tsv size is (66999, 3)
The output matrix data/matrix-train.p size is (66999, 6002) and data type is float64
The input data frame data/Posts-test.tsv size is (33001, 3)
The output matrix data/matrix-test.p size is (33001, 6002) and data type is float64
Reproducing 'model.p.dvc':
python code/train_model.py 20180319
Input matrix size (66999, 6002)
X matrix size (66999, 6000)
Y matrix size (66999,)
Reproducing 'Dvcfile':
python code/evaluate.py

The process started from the feature creation step because one of it’s parameters was changed — the edited source code code/featurization.py. All dependent steps were regenerates as well.

Let’s take a look at the metric’s change. The result improvement is close to zero (+0.0075% to be precise):

$ cat data/eval.txt
AUC: 0.624727

This is not a perfect result but this result gives us some information about the model.

It is convenient to keep track of information even for failed experiments. Sometimes a failed hypothesis gives more information than a successful one.

Let’s keep the result in the repository. Later we can find out why bigram does not add value to the current model and change that.

Many DVC-files were changed. This happened due to md5 checksum changes.

$ git status -s
M Dvcfile
M code/featurization.py
M matrix-train.p.dvc
M model.p.dvc

Now we can commit the changes:

$ git add .
$ git commit -m Bigrams

3.3. Checkout code and data files

The previous experiment was done in the feature extraction step and provided no improvements. This might be related with not having perfect model hyperparameters. Let’s try to improve the model by changing the hyperparameters.

There is no good reason to improve the last bigram based model. Let’s checkout the original model from the master branch.

Note, after checking out code and DVC-files from Git, data files have to be checked out as well by the $ dvc checkout command.

$ git checkout master
$ dvc checkout
# Nothing to reproduce since code was checked out by `git checkout`
# and data files were checked out by `dvc checkout`
$ dvc repro

After proper checkout there is nothing to reproduce because all the correct files were checked out by Git and all data files — by DVC.

In more detail — $ git checkout master checked out the code and DVC-files. The DVC-files from the master branch point to old (unigram based) data file outputs and dependencies. $ dvc checkout command found all the DVC-files and restored the data files based on them.

3.4. Tune the model

You should create a new branch for this new experiment. It will help you to organize all the experiments in a repository and checkout them when needed.

$ git checkout -b tuning
# Please use your favorite text editor:
$ vi code/train_model.py

Increase number of trees in the forest to 500 by n_estimators parameter in the RandomForestClassifier class and number of jobs (line 27):

clf = RandomForestClassifier(n_estimators=700,
n_jobs=6, random_state=seed)

Only the modeling and the evaluation step needs to be reproduced. Just run repro:

$ dvc repro
Reproducing 'model.p.dvc':
python code/train_model.py 20180319
Input matrix size (66999, 5002)
X matrix size (66999, 5000)
Y matrix size (66999,)
Reproducing 'Dvcfile':
python code/evaluate.py

Validate the metric and commit all the changes.

$ cat data/eval.txt
AUC: 0.637561

This seems like a good model improvement +1.28%. Please commit all the changes:

$ git add .
$ git commit -m '500 trees in the forest'

3.5. Merge the model to master

Now we can revisit the failing hypotheses with bigrams which didn’t provide any model improvement even with one thousand more features. The current model with 500 trees in the forest is stronger and we might be able to get more information with bigrams. So, let’s incorporate the bigram changes into the current model by a regular Git merge command.

Git merge logic works for data files and respectively for DVC models.

But first, let’s create a branch as usual.

$ git checkout -b train_bigram
$ git merge bigram
Auto-merging model.p.dvc
CONFLICT (content): Merge conflict in model.p.dvc
Auto-merging Dvcfile
CONFLICT (content): Merge conflict in Dvcfile
Automatic merge failed; fix conflicts and then commit the result.

The merge has a few conflicts. All of the conflicts are related to md5 sum miss-matches in the branches. You can properly merge conflicts by prioritizing the checksums from the bigram branch.

Or you can simply replace all the checksum by empty string ‘’.

The only disadvantage of the last, empty string tricks — DVC will recompute the outputs checksums. After resolving the conflicts you need to checkout a proper version of the data files:

# Replace conflicting checksums to empty string ''
$ vi model.p.dvc
$ vi Dvcfile
$ dvc checkout

And reproduce the result:

$ dvc repro
Reproducing 'model.p.dvc':
python code/train_model.py 20180319
Input matrix size (66999, 6002)
X matrix size (66999, 6000)
Y matrix size (66999,)
Reproducing 'Dvcfile':
python code/evaluate.py

The target metric:

$ cat data/eval.txt
AUC: 0.640389

The bigrams increased the target metric to 0.28% and the last change looks like a reasonable improvement of the ML model. So, the result should be committed:

$ git add .
$ git commit -m 'Merge bigrams into the tuned model'

Now our current branch contains the best model and it can be merged into master.

$ git checkout master
$ dvc checkout
$ git merge train_bigram
Updating f5ff48c..4bd09da
Fast-forward
Dvcfile | 6 +++---
code/featurization.py | 3 ++-
code/train_model.py | 2 +-
matrix-train.p.dvc | 6 +++---
model.p.dvc | 6 +++---
5 files changed, 12 insertions(+), 11 deletions(-)

Fast-forward strategy was applied to this merge. It means that we have all the changes in the right place and reproduction is not needed.

$ dvc checkout
# Nothing to reproduce:
$ dvc repro

4. Sharing data

4.1. Pushing data to cloud

It is pretty clear how code and DVC-files can be shared through Git repositories. These repositories will contain all the information needed for reproducibility and it might be a good idea to share these DVC-repositories by GitHub or other Git services.

DVC is able to push the cache to a cloud.

Using your shared cache a colleague can reuse ML model that were trained in your machine.

First you need to modify cloud settings in DVC config file. This can be done programmatically:

$ dvc config core.cloud AWS
$ dvc config AWS.StoragePath dvc-share/classify
$ git status -s
M .dvc/config

Then, a simple command pushes files from your local cache to the cloud.

$ dvc push
(1/9): [##############################] 100% 23/404ed8212fc1ee6f5a81ff6f6df2ef
(2/9): [########## ] 34% 5f/42ecd9a121b4382cd6510534533ec3

The command does not push all the caches but the only caches for data files that belongs to the current repository workspace.

For example, in this tutorial 16 data files were created and only 9 will be pushed because the rest of the data files belong to different branches like bigram.

4.2. Pulling data from cloud

In order to reuse your data files a colleagues of yours needs to pull data the sam way from the master branch.

$ git clone https://github.com/dmpetrov/classify.git
$ dvc config AWS.StoragePath dvc-share/classify
$ dvc pull

After this command all the data files will be in the right place. You can check that by trying to reproduce the default goal:

# Nothing to reproduce:
$ dvc repro

5. DVC commands

The diagram below describe all the DVC commands and relationships between local cache and cloud.

Summary

Git branches beautifully reflect the non-linear structure of ML processes where each hypotheses can be presented as a Git branch. DVC makes it possible to navigate through Git branches with code and data which makes the ML process more manageable and reproducible.

--

--

Dmitry Petrov
Data Version Control

Creator of http://dvc.org — Git for ML. Ex-Data Scientist @Microsoft. PhD in CS. Making jokes with a serious face.