Fundamentals of MLOps — Part 2 | Data & Model Management with DVC

Tezan Sahu
Analytics Vidhya
Published in
18 min readSep 5, 2021

In Part 1 of this 4-blog series, we understood the motivation behind adopting MLOps practices & also looked at some of the basic principles of MLOps. We also had a glance at the arsenal of tools & frameworks used to implement MLOps in various stages of an ML workflow lifecycle. From this blog onwards, we will dive deeper into some of these tools & learn how to actually use them in our own projects for better reproducibility & collaboration.

Contents

  • Versioning in ML Projects
  • What Is Data Version Control (DVC)?
  • Exploring the Basic Functionalities of DVC
  • Initialization
  • Tracking a File using DVC
  • Storing Data Remotely
  • Retrieving Remotely Stored Data
  • Making Modifications to the Data
  • Switching between Versions of Data
  • Accessing Data Tracked by DVC
  • DVC Python API
  • Using DVC on a Shared Server
  • Concluding Remarks
  • Additional References

Versioning in ML Projects

Software development includes the continuous process of documenting & maintaining source code, modifying them & collaborating with other developers; and a version control system (like Git & Mercurial) makes this task easier. Similarly, in ML & Data Science, creating & deploying ML models to production is an iterative process requiring extensive versioning for maintenance & reproducibility. Such projects are a stark contrast to the usual software development projects because ML requires both code & data — thus, versioning here is not restricted to the code, but also involves the versioning of data, hyperparameters & other metadata as well.

We are all familiar with Git as a well-known & useful version control tool used widely in software development. So, although it is the holy grail of version control, why is it not advisable (or efficient) to use only Git for version control ML projects?

  • ML/DL projects usually involve tons of data, going into 100s of MBs (& sometimes GBs & TBs as well), while the upper limit on the file size that Git allows being pushed into a GitHub repo is only 100MB. Note that solutions like git-lfs (Git Large File Storage) avoid such issues, but they are not flexible enough.
  • During collaboration among people in an organization, it may be required that access to data (or some other portion of the pipeline) is restricted only to a certain set of people. If the data is tracked directly along with other code in a GitHub repo, it may be difficult to provide such varying levels of access.
  • Moreover, actual data for training a model may already be present remotely in some storage service (like Amazon S3, Azure Blob Storage, etc.) & it would be wasteful to copy over all that data into a GitHub repo & track it separately.

All this calls for some specific tools & frameworks to address this issue of versioning ML artifacts like data, hyperparameters & models that work along with existing version control tools to provide a seamless experience for ML project management. Enter Data Version Control!

Why Data & Models Change?

Before diving into Data Version Control, we shall first try to enlist some of the reasons as to why data & models change in an ML system:

  • Data can be spread across several platforms
  • Data may be frequently changing, requiring periodic updates to our model to perform well (think of click-through rate data for building recommender systems)
  • Models can be retrained with this new data & perhaps newer training techniques
  • Models may degrade over time
  • Underperforming models can be rapidly reverted to an earlier serving version

What Is Data Version Control (DVC)?

Data version control (the concept, not the specific tool!) is a collection of tools and procedures that attempt to adapt the version control process to the world of data. One such tool that helps data scientists govern their data and models and run reproducible experiments is DVC or Data Version Control.

DVC, developed by Iterative.AI is an open source command-line tool written in Python for data science & ML project management & versioning. To take advantage of the existing toolset that most developers are familiar with, it emulates Git commands and workflows so that they may rapidly integrate it into their usual Git practice.

DVC uses a remote repository (including supports all major cloud providers) to store all the data and models for a project. In the actual code repository, a pointer to this remote location is stored to access the actual artifacts.

Features of DVC

Using DVC brings agility, reproducibility, and collaboration into your existing data science workflow. Some of the core features of DVC are:

  • Git-compatible: It runs on top of any Git repository and is compatible with any standard Git server or provider (GitHub, GitLab, etc.)
  • Simplistic Data Versioning: Several versions of data & models are maintained by replacing large files, dataset directories, ML models, etc. with small metafiles (as placeholders) containing pointers to the original data
  • Storage-agnostic: It can use Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network-attached storage, or disc as a remote to store data
  • Reproducible: It makes ML projects reproducible by creating lightweight pipelines using implicit dependency graphs, and by codifying the data and artifacts involved
  • Language- & framework-agnostic: It is independent of the programming languages (Python, R, Julia, shell scripts, and so on) or machine learning libraries (Keras, Tensorflow, PyTorch, Scipy, and so on) used in the project
  • Low friction branching: It supports instantaneous Git branching, even with large files & also prevents data duplication across experiments
  • Easy to use: It is quick to install and doesn’t require special infrastructure

Note: DVC does NOT replace Git! The metafiles that hold references to original data and models in the remote vary with the data during the course of the project & require Git to be version controlled.

Comparing DVC to other Solutions

Apart from DVC, there are several other top tools for data version control that can help automate work and optimize processes. Here’s a quick comparison between DVC & some of the other tools:

Image Source: Comparing Data Version Control Tools

Exploring the Basic Functionalities of DVC

With the basic overview of DVC, let’s dive straight into using DVC for our own ML project to understand the various functionalities that it offers. In this tutorial, we will learn how to:

  • Install DVC & initialize a project
  • Add files to a remote repository (Amazon S3) & retrieve them later
  • Switch between different versions
  • Search & import files from a DVC repo

Later, we will also look into using DVC as a Python package.

Installation

DVC can be installed from its repository or binary package based on the user’s OS, the steps for which can be found in the docs.

However, since DVC can also be used as a Python library (we will explore this later), it can simply be installed using a package manager like pip or conda. Depending on the type of remote storage being used in the project, optional dependencies may have to be installed.

When using such package managers, it is advisable to create & enable a virtual environment before installing the package.

Based on our preferred package manager, we will install DVC (with Amazon S3 remote) as follows:

$ pip install "dvc[s3]"                   # Using pip$ conda install -c conda-forge dvc-s3     # Using conda

In this case, it installs the boto3 library along with DVC.

Feel free to check out the DVC documentation to install dependencies for other remotes that you may want to use.

Now, you can type dvc --help to view all the commands that can be used with DVC.

Initialization

After installing DVC, we create a project directory & initialize it as a Git repo. Create a corresponding GitHub repo & set it as the origin remote for the Git repo. Later, we use dvc init to initialize it as a DVC repo. This is our workspace.

$ mkdir mlops_dvc
$ cd mlops_dvc
$ git init
$ git remote add origin <github-repo-link>
$ git branch -M main
$ dvc init

This will create an internal .dvc folder with the structure mentioned below, that holds internal directories and files needed for DVC operation.

mlops_dvc
├── .dvc
│ ├── .gitignore
│ ├── config
│ ├── plots
│ │ ├── confusion.json
│ │ ├── confusion_normalized.json
│ │ ├── default.json
│ │ ├── linear.json
│ │ ├── scatter.json
│ │ └── smooth.json
│ └── tmp
│ ├── links
│ │ └── cache.db
│ └── md5s
│ └── cache.db
├── .dvcignore
└── .git
  • config: Configuration file, edited using dvc config
  • plots/ : Directory for plot templates
  • tmp/ : Directory for miscellaneous temporary files
  • links/ : SQLite DB storing file links from cache to workspace
  • md5s/ : SQLite database storing MD5 hashes for tracked files
  • .dvcignore : Files and directories excluded when traversing a DVC project

These include the configuration files, local cache, plot templates & temporary files (more information). It is similar to the .git/ folder created on the initialization of a Git repo. The dvc init command also creates a .dvcignore file (similar to .gitignore) to contain a list of paths for DVC to ignore.

Now, we commit these internal files to the Git repo using git commit -m "Initialize DVC in repo".

From here on, we will create and manage different DVC files and populate the cache as we use DVC and work on our ML project.

Tracking a File using DVC

Once our DVC repo is initialized, we can start adding data & code to our Git repo, as required by our ML project. It is good to maintain a good directory structure for your ML project to separate the data, code, models, metrics, etc. Following is a very generic directory structure that can be adopted for most ML projects:

project_dir
├── data/ # Directory with raw and intermediate data
│ └── prepared/ # Processed dataset
├── metrics/ # Plots & logs with performance metrics
├── models/ # Trained models
└── src/ # Files for processing, training, evaluation

The data/ directory may usually have training data in the form of files (like XML, CSV, TSV, etc.) or in folders (like images belonging to various classes) which may be of huge sizes. For this tutorial, we will download this toy dataset (data.csv) & place it in the mlops_dvc/data/ folder. Now, type in the following commands in the terminal to enable tracking of the data.csv file:

$ dvc add data/data.csv     # To start tracking a file or directory

With this, DVC saves information about the newly added files (or directories) in a special .dvc file named data/data.csv.dvc (a tiny text file in a readable format). This metadata file serves as a placeholder for the actual data and can be readily versioned using Git, much like source code. Its contents are as follows:

outs:
- md5: 97edcee3524b1675ab68790ad35e80e9
size: 158375420
path: data.csv
isexec: true

In the background, dvc add moves the data to the project's cache, and links it back to our workspace. You can verify that .dvc/cache/ folder contains a subfolder 97, having a file named edcee3524b1675ab68790ad35e80e9. The MD5 hash value of the data.csv file we just added (97ecdce...) determines this cache path.

Now, we track the data.csv.dvc file using Git, while ignoring the actual data.csv file.

$ git add data/data.csv.dvc data/.gitignore  # Adds the actual data/data.csv file to .gitignore so that it is ignored by Git tracking
$ git commit -m "Add raw data" # Commit the data/data.csv.dvc for Git tracking

Storing Data Remotely

Once data is tracked using DVC, it can be safely stored remotely (DVC supports several remote storage options, as discussed previously).

Setting up the Amazon S3 Bucket

Since we plan to store the data in Amazon S3, we will first have to set up our S3 bucket that will store our data as follows:

Step 1: Log into your AWS Console

Step 2: Search for “s3” in the search box & select the S3 service

Step 3: Click on “Create Bucket” in the S3 Dashboard

Step 4: Enter a Bucket Name & select the “Asia Pacific (Mumbai) ap-south-1” AWS Region. For example, I have set the bucket name as mlopsdvc170100035. (This is because S3 bucket names must be unique within a partition - a group of regions) Leave all other settings default & hit the "Create Bucket" button.

Step 5: Now we need to create our Access Key that will enable DVC (on our local system) to communicate with the Bucket. For this, expand the menu under your profile name & click “My Security Credentials”.

Step 6: Click the “Access keys (access key ID and secret access key)” & then the “Create New Access Key” button. This will create a unique access key pair for you.

You should download the key file & keep it safe because this is the only time that you will be shown your secret access key.

Step 7: Now, create 2 environment variables named AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY containing your access key & secret access key respectively, which will be picked up by the boto3 library when we try to upload/download data from our S3 bucket. Also, create & set the AWS_DEFAULT_REGION variable to ap-south-1.
Reference for Windows Users
Reference for Linux Users

With this, we have set up our AWS Credentials & Amazon S3 bucket to store our data remotely using DVC.

Pushing Data to Remote Storage

To store the DVC-tracked data & model files remotely, we first need to configure the remote storage location as follows:

$ dvc remote add -d storage s3://mlopsdvc170100035/dvcstore  
# Enter your respective bucket name
$ git add .dvc/config
$ git commit -m "Configure remote storage"

The first command creates a remote named storage & links it to a (newly created) folder dvcstore in our S3 bucket. The -d flag ensures that this remote is the default one used for uploading & downloading data.

Next, we push our data into the remote storage (dvc push copies the data cached locally to the remote storage) & later push to our GitHub repo:

$ dvc push
$ git push origin main

After completing the above steps, you can check your Amazon S3 console to find that the data has been pushed into your bucket.

On inspecting your GitHub repo, you should notice that the data/ folder contains only the .gitignore & data.csv.dvc files, whereas the actual data.csv file isn't stored on GitHub.

Retrieving Remotely Stored Data

DVC-tracked data & models may be retrieved and used in other versions of this project using dvc pull as needed. It is often executed after git clone & git pull.

To simulate this situation in this tutorial, the .dvc/cache/ folder & data/data.csv files need to be deleted, following which we pull from the DVC remote.

$ rm -rf .dvc/cache
$ rm -f data/data.csv
$ dvc pull

We can easily verify that the data.csv file (of ~150MB) has been downloaded from our remote to our workspace.

Making Modifications to the Data

In real-world ML projects, we may use multiple versions of the data & models across the various experiments. Some potential causes of this may be:

  • Different pre-processing techniques being used in different experiments
  • The inflow of new data into the system from time to time from some external source
  • Newer models trained with different hyperparameters

Under such circumstances, DVC helps track these changes effectively. To simulate such a situation, we double our original dataset as follows:

# Copy the contents of data.csv to a temporary location
$ cp data/data.csv /tmp/data.csv
# Append contents from this location to the end of data.csv
$ cat /tmp/data.csv >> data/data.csv

We can verify this modification by noting that the size of this new data.csv file is twice that of the old file. Now, we are ready to track this change & upload the new file to our S3 bucket:

$ dvc add data/data.csv
$ git add data/data.csv.dvc
$ git commit -m "Update dataset"
$ dvc push

It must be evident that pushing this new file takes almost twice as long as the previous push due to the increase in the size of the file

You can verify the push checking that the MD5 hash in the updated data/data.csv.dvc file (7663219...) corresponds to the file path in the S3 bucket.

Now you can push to GitHub as well to track the new changes.

Switching between Versions of Data

Once we have several versions of our data & models, it is obvious that we may need to switch between these versions time & again. Just as Git has the checkout command to switch between different versions of our code, DVC offers the checkout command to synchronize the data & models according to the .dvc files. The usual practice is to run git checkout <...> (to switch a branch or checkout a .dvc file version), followed by dvc checkout.

To get to the previous version of our dataset & commit, we perform the following:

$ git checkout HEAD~1 data/data.csv.dvc
$ dvc checkout
$ git commit data/data.csv.dvc -m "Revert updates"

Note that we don’t need to dvc push again because this version of the dataset is already stored in our Amazon S3 bucket. Thus, we can say that DVC isn't even a version control system in the strictest sense! Data file versions are defined by the contents of the .dvc file, while version control is provided through Git. DVC then produces these .dvc files, changes them, and efficiently synchronizes the DVC-tracked data in the workspace to match them.

Accessing Data Tracked by DVC

Once we can track our data & models effectively, the very next thought that arises is how can we access these tracked artifacts to reuse the datasets or deploy a model? In this section, we will discuss the way we can look for relevant files in a DVC-tracked repository & download data from it.

Searching for Files/Directories using DVC

To download data & models, we first need to find where they are located & if they are accessible. To explore a DVC repository hosted on any Git server, the command dvc list can be used. It lists the project contents, including files, models, and directories tracked by both DVC & Git. The syntax for this command is as follows: dvc list <repo-url> [<file-path>] ([...] implies that the <file-path> is optional)

As an example, running dvc list (in some other directory) on your respective GitHub repo that you have been using for this tutorial yields:

$ dvc list <https://github.com/><your-username>/<repo-name>
.dvcignore
data
$ dvc list <https://github.com/><your-username>/<repo-name> data
.gitignore
data.csv
data.csv.dvc

Downloading Relevant Files & Directories

Both dvc get & dvc import can be used to download files or directories from a DVC-tracked repository. However, dvc get downloads the files without maintaining any of the metadata that connects it to the project, whereas dvc import also creates the corresponding .dvc files to be saved in the project.

dvc import is essentially a combination of dvc get & dvc add applied together.

The syntax of these commands is as follows: dvc get/import <repo-url> <file-path> -o <output-file-path>

The data.csv file stored in our S3 & tracked by the data/data.csv.dvc file can be downloaded into any project as follows:

$ dvc get https://github.com/<your-username>/<repo-name> data/data.csv -o data_downloaded/data.csv$ dvc import https://github.com/<your-username>/<repo-name> data/data.csv -o data_downloaded/data.csv

Note: dvc get can be called from anywhere, but dvc import must be called from a DVC-tracked repository.

DVC Python API

As mentioned previously, DVC can be used as a Python library as well by importing like any other Python module using import dvc.api. The goal of this API is to give developers some programmatic access to the data or models versioned in DVC-tracked repositories. Two of these library functions that we will go through are:

get_url()

Given the DVC-tracked GitHub repository URL (repo) & path to file (path), it returns the URL string of the storage location where the actual file is stored in the DVC remote.

Example:

import dvc.apix = dvc.api.get_url(repo="https://github.com/<your-username>/<repo-name>", path="data/data.csv")
# x stores the URL of the corresponding file in your S3 bucket
# (like s3://mlopsdvc<your-roll-number>/dvcstore/<path-to-file>)

open()

Given the DVC-tracked GitHub repository URL (repo), the path to file (path) & mode in which the file is opened (mode), it opens the data/model file & generates the corresponding file object.

Example:

import dvc.api
import pandas
with dvc.api.open(
repo="https://github.com/<your-username>/<repo-name>",
path="data/data.csv",
mode="r"
) as fd:
df = pandas.read_csv(fd)

Using DVC on a Shared Server

Usually, individual laptops are not used for computationally intensive work in many academic and professional contexts, since they are not powerful enough to handle huge volumes of data or intense processing. Teams instead rely on the cloud or on-premises workstations, where multiple people frequently collaborate on the same development server. Although this allows better resource utilization such as GPU access & centralized data storage, there is a high chance of unnecessary duplication of the same data across several users & repositories if is not managed correctly.

Given this setting, DVC can play a crucial role in maintaining a shared cache on the development server such that all users can have near-instantaneous workspace restoration and switching speeds simply using the dvc checkout command.

After a DVC repository is initialized with dvc init, by default DVC puts the cache in the repository’s .dvc/cache folder. To create & use a shared cache on the server, a directory for the external cache must be created configured with appropriate permissions & linking.

Image Source: Created by author

Given below is a step-by-step example of setting up a shared DVC cache on a server & using it in a DVC-tracked repository:

# Create a folder to behave as external cache
$ mkdir -p /home/shared/dvc-cache
# Move cache from current repo to external cache (optional)
$ mv .dvc/cache/* /home/shared/dvc-cache
# Set cache of DVC-tracked repository as folder made above
$ dvc cache dir /home/shared/dvc-cache
# Set group permissions on new cache files
$ dvc config cache.shared group
# Enable symlinks to avoid copies from cache to workspace
$ dvc config cache.type symlink

On inspective the .dvc/config file in the repository, the following section appears:

[cache]
dir = /home/shared/dvc-cache
shared = group
type = symlink

The cache.type could be reflink, hardlink, or copies. Once any change is made to the cache.type , DVC needs to be told explicitly about it using dvc checkout --relink before the change takes place.

Concluding Remarks

In this post, we began to actually put some of the MLOps principles into practice by using DVC to efficiently version & manage our data & models. As we have seen, DVC is a powerful & flexible tool, & we have just covered some of the fundamental features that it offers. Although there is a lot more to DVC in terms of creating reproducible ML pipelines, I choose not to cover those in this series because, in the upcoming blogs, I will introduce a couple of more tools & frameworks for creating & managing ML pipelines end-to-end more effectively. Nevertheless, interested folks can feel free to go through the DVC documentation to understand the capabilities of DVC.

I hope that you found DVC interesting & would encourage you to use it henceforth in your ML & Data Science projects to maintain a clean workspace. Also, feel free to check out the procedure to use Google Drive as a DVC remote storage.

In the next blog, we will shift our attention from versioning to using the data & create end-to-end ML pipelines for performing various kinds of experiments using an interesting & powerful Python library called PyCaret.

Following are the other parts of this Fundamentals of MLOps series:

Thank you & Happy Coding!

If you enjoyed this article, I’m certain that you’d love my brand-new FREE AI Products & Research newsletter, “The Vision, Debugged”.

Subscribe & join the bandwagon of enthusiastic readers across top companies like Microsoft, Google, Walmart, Deloitte & more to get cool AI products & research insights, cheat sheets & resources.

About the Author

Hey folks!

I’m Tezan Sahu, an Applied Scientist at Microsoft, an Amazon #1 Bestselling Author (for the book “Beyond Code: A Practical Guide for Data Scientists, Analysts & Engineers”), and co-author of “The Vision, Debugged” newsletter.

I am passionate about helping aspiring data scientists & software developers kickstart their careers, deliver consistent impact & become differentiated professionals in the field of AI & Data Science.

If you are interested in learning more about how you can leverage AI to stay ahead of the curve and boost your results, connect with me on LinkedIn & subscribe to my newsletter.

--

--

Tezan Sahu
Analytics Vidhya

Applied Scientist @Microsoft | #1 Best Selling Author | IIT Bombay '21 | Helping Students & Professionals Ace Data Science Roles | https://topmate.io/tezan_sahu