Creating reproducible data science workflows with DVC

Published in

Yandex school of Data Science

15 min readOct 2, 2019

The widespread adoption of machine learning is a recent phenomenon. As the field becomes more and more mature, the typical problem size becomes larger. While best practices are starting to emerge, with increased problem size comes increased complexity and inherent structural problems: messy assets, poor reproducibility and fragmented (if any) experiment tracking.

Modern machine learning solutions require a huge number of data files, and thousands of lines of code to process the data and create ML models. They also require versatile metadata, which combines information about how the data was processed and what artefacts or results were created.

Despite being implemented in software, development of data science and machine learning projects is dramatically different from general-purpose software development: for example, DS is primarily experiment-driven and has a much higher level of intrinsic unpredictability.

To illustrate the previous point, in a typical software development process, a team has a backlog of features to implement and goes through it based on some priorities. Although this process is not completely deterministic in terms of resources used, it’s quite clear, that once a feature is selected for development, it will be implemented sooner or later.

Data science and machine learning are very different. Instead of features to implement, in ML we have ideas to try out, without any guarantee of succeeding. Before long, you may test a few dozen hypotheses, often substantially varying in their complexity and the implementation required. Most of them will fail, some may look promising, and ultimately, some complicated combinations of them may succeed.

Eventually, you or your colleague will try to reproduce one of those models, the one which seems to be the best. You may find out that you cannot easily reconstruct how exactly that model was created. It may turn out that the training parameters are buried somewhere in a notebook, overwritten by subsequent experiments and multiple Git commits in several branches. In addition, perhaps the training/cross-validation split was performed without setting a random seed. And if you’re still not worried enough, the cherry on top is that when examined, it turns out the current features look nothing like what they were when the model was created. Too bad.

However, there’s a solution. It may take some time to introduce and polish healthy processes and master the required tools, but you’ll quickly find them invaluable once you’ve done so.

Data science processes

We definitely need some method to eliminate the mess and confusion. There were several attempts to outline the principles behind such a method, and some of them have led to consistent and reasonable sets of rules-of-thumb.

Some teams use the variants of Coockiecutter Data Science, others follow the approach outlined in Guerilla Analytics. Both address the same problems, both offer powerful tools and we recommend you to try them out.

Chances are, you or your team already use something similar. But regardless of which approach you use to write reproducible data science code, you need tooling.

The bare minimum requirements are the following:

a way to version-control the data, especially intermediate artefacts like pre-computed features and models,
being able to pass data files inside the team in a controllable and trackable way,
a tool to reproduce any artefacts or results, in a simple and automated way, regardless of how long ago they were originally created.

In this tutorial we will explore, how DVC implements all of the processes we’ve outlined and makes reproducible data science easier. DVC is open-source and attempts to be a Git for machine learning, while working closely together with Git itself.

DVC is actually quite a large tool, but the most common operations are simple enough, in the same way that basic Git commands are easy to learn and incorporate into daily practice.

DVC is not the only tool for the job. It works best for small to middle-sized projects and solves the problem without adding too much complexity. However, depending on your needs, project size and deployment considerations you may find Kedro or other tools more suitable. We will cover some of them in future tutorials.

Versioning the data

We start with the fundamental task of data version control. Let’s first define what data version control is:

state of any data file, whether original one or derived, must be recorded,
there must be a tool to switch between different versions of data files.

Consider the following scenario: the training data comes from a relational database and is stored as a CSV file. Once in a while, you want to update the dataset with recent records from the database. Each time you do so, you record the state of the dataset. If you have a way to switch to any of the previous versions and back — congratulations, your data is version controlled.

Git is not suitable for this, as it was not designed to serve large or binary files, while extensions like Git LFS are general-purpose and can be used for data version control only with some limitations and inconvenience. DVC offers a more flexible approach.

To illustrate it we will use Titanic dataset from Kaggle and build a simple model and a submission file. With this miniature data science project, we will see, how DVC helps to ensure data lineage and reproducibility.

Install DVC

First, we need to install DVC. This can be done with pip

or conda:

Note, that you can configure DVC to use external storage to hold and exchange data, and in that case you’ll also need to install additional dependencies. For example, if you plan to use Amazon Web Services S3, you need to install boto and some other packages with

Project layout

Our project starts with the Titanic dataset, which contains two files (one for training and one for testing), and the project structure.

It may be tempting to have all the data files and code in the same directory for the project this small.
However, it’s strategically wiser to stick to the same project structure for any project, regardless of its size. A disciplined approach to project structure and operations saves a lot of headache and time along the road.

Let’s create a skeleton for our project:

The original data goes in thedata directory. Although we have only two original data files, for larger projects there may be tens, thousands or even millions of data files, so it’s reasonable to have a separate directory for them.

All derived features and intermediate data files go to features directory. Results (for example, trained models and submission files) will live in results directory.

pytitanic directory will contain Python code for the project: scripts, modules, packages, etc. Additionally, you may have notebooks directory, and directories for code in other languages (for example, in R or Julia).

We will keep README.md empty out of simplicity, although we still create it for the sake of the general procedure.

To finish the project setup, we need to add data files and initialize Git repository and DVC. Given that you have already downloaded the data as a Zip archive in data directory:

You should see three new files (gender_submission.csv, test.csv and train.csv) in data directory now. We will not use gender_submission.csv, so let’s remove it along with Zip archive:

Note, that we do not put data files under Git control, as their versioning will be handled by DVC. From now on, data files won’t be managed by Git directly.

Managing the data with DVC

We are now ready to initialize DVC for our project. To do this, launch

Several things happen when DVC performs initialization. First, it creates .dvc directory to hold its own files needed for operation. .dvc directory is the same for DVC, as .git is for Git.

Second, DVC instructs Git on how to handle newly created files. If you look at current status for Git (with git status), you’ll see, that DVC staged its files to commit:

.dvc/.gitignore file instructs Git to skip some DVC internal files from .dvc, while .dvc/config contains the newly created configuration for DVC, which is empty for now.

DVC tries to name commands in a familiar way. Most of the time, DVC command does exactly what you would expect it to do based on your Git experience.
Moreover, DVC is a pretty verbose tool and most of the commands output meaningful and useful messages, so that you can understand what’s going on and what to do next.

Let’s commit the changes:

We are now ready to track the data files with DVC. To tell DVC about data/train.csv and data/test.csv we’ll use dvc add:

DVC verbosely instructs us on what’s happened:

Let’s break this down. First, DVC creates yet another .gitignore file to exclude original data files from Git tracking.

Second, something more important happens: DVC puts data files in its cache, and creates two metafiles (data/train.csv.dvc and data/test.csv.dvc) with the information about original data files.

Metafiles follow YAML standard and have a specific set of attributes (use cat data/train.csv.dvc to look into the file):

Top-level md5 attribute contains MD5 checksum for *.dvc file contents, while md5 attribute under outs contains the checksum of the data file itself.

You may notice, that md5sum utility calculates different values for both data file and *.dvc file. That’s ok, as DVC calculates MD5 on a transformed version of a file. For text files it changes EOL sequence from \r\n (which is the case for Titanic dataset files) to \n.
For *.dvc file itself it’s even more elaborate: top-level MD5 attribute contains checksum not for the *.dvc file itself (think about this for a moment), but for properly encoded string representation of the contents, with some filtering applied.

Now let’s look at the cache. It’s located by default in.dvc/cache:

As you can see, DVC stores data files in the cache according to their MD5: first two symbols form directory name, while remaining ones are used as the cache file name.

We can now commit DVC metafiles to Git:

Note, that Git knows nothing about data files themselves, all the information needed to track them is stored in DVC files, while Git serves as an upper-level tool to track DVC itself.

Moving data around in a controllable way

As the data files are now under DVC control, we can start using it. For example, if you accidentally deleted one of the data files, you can recreate it from cache with dvc checkout:

dvc checkout looks into the target file (data/train.csv.dvc in this case) and retrieves the corresponding version from the cache. This is, of course, a simple example, but it illustrates the pattern.

A more elaborate example would include remote storage. DVC can store files outside the working directory. This allows to easily share files using DVC tools. DVC allows using a local directory, AWS S3, Azure, and other destinations as remotes.

Let’s create local remote storage for the data files:

We can now push the data to the newly created remote:

Remotes are structured similarly to local cache:

By itself, remotes are of limited usefulness. However, they become crucial when you work in a team. Let’s illustrate this by creating a clone of our current repository:

Original data files are not there, but DVC metafiles are, as they are tracked by Git. This allows us to easily fetch data files from existing remote storage:

Information about remotes is stored in DVC config files (.dvc/config). Let’s get back to the original repository and look at how the configuration file changed:

We can now commit the changes to DVC config:

Note, that the newly created remote is local and thus its configuration may be kept outside of Git. DVC allows user to have several types of configuration, and the one called local is excluded from Git tracking. To use local configuration when creating remotes, just add — local option to dvc remote add command. We added this option for titanic-dvc-copy repository for illustration.
The same — local option should be used when creating cloud-backed remotes, as you’ll need to add credentials to access AWS S3, Azure or GCP remotes and it’s not recommended to have them in Git.

Data versioning

So far, we used DVC to only add files to cache and remote storage. This is only a part of the story. More importantly, data files can be versioned. Of course, Titanic dataset would not change, but real datasets can change over time.

Imagine again the dataset we discussed at the beginning of the section. Over time, new records arrive at that database and you want to update the dataset with recent data points. The obvious (and wrong) way to do this is to create a new dump with just a different file name. However, this is wrong both from conceptual and organizational viewpoints.
First of all, conceptually that’s the same dataset, just a new version of it. Adding new data points doesn’t change the data meaning and structure itself. Second, you soon won’t be able to track all the versions only by file names, neither you nor your team would be able to easily reproduce previous results. Data versioning is a much better approach.

We will simulate changes in data by just renaming Name column to FullName in both data files. This is enough for our purposes, as DVC doesn’t care, what actually changed, it just tracks the changes.

For convenience, let’s tag the latest Git commit so that we can easily checkout files from it without messing with hashes:

We can now add edited data files to DVC:

At the moment, data files in the working directory contain renamed columns, and DVC added them to cache (check this with tree .dvc/cache) and we pushed it to the local remote. So far, all the changes were propagated to all locations.

Lets’ assume, that you want to get the original version of the data. We intentionally tagged the corresponding commit, and now can easily checkout DVC metafiles for that version:

With the metafiles for base-dataset, we can easily get the original version of the data files from cache or remote storage:

If you checkout the data files, you will see, that column is named Name, as it was in the original files. To revert data files to the current form with FullName instead of Name, just reset corresponding metafiles to Git `HEAD` and perform dvc checkout again.

Note, that with DVC we effectively version control metafiles. All the tracking of actual data files is performed by DVC based on information in metafiles.

As you can see, DVC is convenient and simple enough to be used for data versioning. It allows to easily record the state of the data, and switch between different versions (with some help from Git).

This reduces the mess and helps to keep data coherent across teammates and locations. However, DVC can do more: it can track calculations and results, allowing to recreate any previous result without a lot of trouble.

Managing calculations with DVC

DVC has two main concepts for reproducible calculations: stages and pipelines. Let’s start with the simpler one and create a DVC stage, which calculates some features. We will not go too far right now, and will just make some columns categorical (see pytitanic/features.py):

Now, we add the features calculation code to features.py:

The code is simple and self-explanatory so we would not go through it. In a typical environment, you would immediately launch python -m pytitanic.features …, but with DVC it works a bit differently.

First, let’s add newly created Python files to Git:

Now, let’s create a DVC stage:

Let’s decrypt this. First, we instruct DVC to run command and create a stage (and a corresponding stage file) with dvc run -f features/features.dvc. Stage in DVC is just a chunk of tracked computation. All information about how stage is performed is stored in the stage file (features/features.dvc in this case).

To tell DVC about dependencies, we use -d option and list all the files needed for the computation. With -o we provide outputs of the stage. The final part is the command itself.

As we launch the command above, DVC will create the stage file, which contains all the information needed to recreate the stage results at any time in the future:

Note, that DVC adds output files to the cache, so you do not need to do this manually. However, we need to commit stage file and .gitignore created by DVC in features directory (again, .gitignore was added by DVC to exclude output files from Git tracking) and create a tag for convenience:

You may ask, what’s the point here? Can’t we just put output files under DVC control manually as we did before? Yes, we can, but now we not only have them in cache (and so can push to any remote to share with others) but also can easily recreate the calculation with updated code.

Creating pipelines

Calculations, however, may be more complex than just a single stage. In that case, we want some modularity instead of a single monolitic computation.

DVC has tools for that too. To illustrate that, we will now proceed to actual machine learning tools. For that, we will create a simple CatBoost model with the features we have:

This file looks large, but it’s very straightforward. First, we create a stratified random split. We can stratify on any categorical column, but let’s use Pclass as the default (think for a moment - why are we doing this?). We then perform some missing values imputation using the training set and finally, we train the model. All the results are stored on disk.

Several things to note:

- parameters are explicit and captured in the command line. This allows us to have reproducible commands without any implicit defaults,

- this script creates what is called metrics file alongside model file and submission. We will see later, how useful this is in a combination with DVC,

- random state is explicit, and can be provided as a command-line parameter which has a default value. We won’t lose track of it.

We are now ready to train the model using dvc run:

What we actually created is a DVC pipeline. To generate the output required, DVC checks if all the dependencies are at place since features/train_features.csv and features/test_features.csv are themselves results of another stage (features/festures.dvc). Now that we have defined the stage for results/model.dvc, we can use it with dvc repro. Additionally, you can inspect a pipeline:

When DVC tries to reproduce results/model.dvc, it first constructs a dependencies graph and decides on whether any of the dependencies must be reproduced. Right now they are all up to date and DVC does not perform any calculations:

However, if we remove one or both of the features files, DVC will recognize that and reproduce them first:

You may note somewhat unfortunate name of dvc repro command. It does not reproduce the target in the scientific sense but rather recalculates it. To reproduce one of the previous versions you either should checkout it from DVC cache, or checkout corresponding Git commit and actually recalculate it again.

As you can see, DVC goes back through all the stages we defined so far and recognizes that some of the intermediate results needed for results/model.dvc are missing. It reproduces features/features.dvc stage, but correctly determines, that the actual reproduced files are up to date, so there’s no need to either save them to cache or to reproduce results/model.dvc.

Another ingredient is the metrics file. DVC can track metrics alongside other outputs. This allows to later recall the performance of the model:

Moreover, DVC can fetch specific metrics directly:

In this case, DVC recognizes, that the metrics file is in JSON format, and uses path (-x or — xpath) to find the actual field inside the file.

We can now commit the newly created stage:

Running experiments

With stages, pipelines and metrics files, DVC allows performing even more flexible operations. Consider this: after you’ve created the basic features and your first model, you want to add some additional features and retrain the model. You want to determine if the new features are better. How would you achieve this?

The workflow may be as follows:

create a new Git branch,
add new code to calculate additional features,
reproduce the pipeline for results/model.dvc
compare metrics for the model on initial features with the one, trained on new features.

Let’s try this out. First, we create a branch:

Now, we add a new feature (well known PclassSex) in pytitanic/features.py:

We now need to reproduce results/model.dvc:

DVC will report, that since one of the dependencies changed, the stage must be recreated and then will launch the calculation. Both features/features.dvc and results/model.dvc will be updated and we can commit it now:

We can now launch dvc metrics and see, how the newly created model compares with the previous one:

DVC has an option for dvc metrics show to show all available metrics over all branches, namely -a. This helps us to compare metrics either between different models inside a branch or different versions of the same model. Here we can see, that the new model is slightly better in terms of accuracy.

We may now decide to merge this branch back to master or submit both submission files to Kaggle to compare leaderboard metrics.

Moving forward

DVC is a powerful tool and we covered only the fundamentals of it. DVC can be more flexible: it can be configured to use links between the working directory and cache to save space, can use any of three main cloud providers for remote storage, or even install Git hooks.

However, similar to Git, DVC is easy to introduce into daily practice with a set of simple rules:

once you started a project, add each original data file to DVC with dvc add,
create artefacts (intermediate data files, model, etc.) only using DVC stages or pipelines,
commit corresponding *.dvc metafiles to Git,
record metrics when running calculations or training with dvc run,
compare different experiments with dvc metrics show,
remember, that DVC does not assign any special meaning to metrics, and you can store any important information as metrics; for example, you may want to store information about running time performance of a calculation,
to move through history and reproduce earlier results, use a combination of git checkout, dvc checkout (with optional dvc pull) and dvc repro,
share your data files through convenient remote storage with dvc push.

Good practices become more important as the industry grows mature. Collaboration patterns are getting more advanced and projects become larger and more elaborate. We hope DVC will help you to maintain your projects in order and simplify your day-to-day operations as a data scientist.