Fundamentals of MLOps — Part 2 | Data & Model Management with DVC
In Part 1 of this 4-blog series, we understood the motivation behind adopting MLOps practices & also looked at some of the basic principles of MLOps. We also had a glance at the arsenal of tools & frameworks used to implement MLOps in various stages of an ML workflow lifecycle. From this blog onwards, we will dive deeper into some of these tools & learn how to actually use them in our own projects for better reproducibility & collaboration.
- Versioning in ML Projects
- What Is Data Version Control (DVC)?
- Exploring the Basic Functionalities of DVC
- Tracking a File using DVC
- Storing Data Remotely
- Retrieving Remotely Stored Data
- Making Modifications to the Data
- Switching between Versions of Data
- Accessing Data Tracked by DVC
- DVC Python API
- Using DVC on a Shared Server
- Concluding Remarks
- Additional References
Versioning in ML Projects
Software development includes the continuous process of documenting & maintaining source code, modifying them & collaborating with other developers; and a version control system (like Git & Mercurial) makes this task easier. Similarly, in ML & Data Science, creating & deploying ML models to production is an iterative process requiring extensive versioning for maintenance & reproducibility. Such projects are a stark contrast to the usual software development projects because ML requires both code & data — thus, versioning here is not restricted to the code, but also involves the versioning of data, hyperparameters & other metadata as well.
We are all familiar with Git as a well-known & useful version control tool used widely in software development. So, although it is the holy grail of version control, why is it not advisable (or efficient) to use only Git for version control ML projects?
- ML/DL projects usually involve tons of data, going into 100s of MBs (& sometimes GBs & TBs as well), while the upper limit on the file size that Git allows being pushed into a GitHub repo is only 100MB. Note that solutions like
git-lfs(Git Large File Storage) avoid such issues, but they are not flexible enough.
- During collaboration among people in an organization, it may be required that access to data (or some other portion of the pipeline) is restricted only to a certain set of people. If the data is tracked directly along with other code in a GitHub repo, it may be difficult to provide such varying levels of access.
- Moreover, actual data for training a model may already be present remotely in some storage service (like Amazon S3, Azure Blob Storage, etc.) & it would be wasteful to copy over all that data into a GitHub repo & track it separately.
All this calls for some specific tools & frameworks to address this issue of versioning ML artifacts like data, hyperparameters & models that work along with existing version control tools to provide a seamless experience for ML project management. Enter Data Version Control!
Why Data & Models Change?
Before diving into Data Version Control, we shall first try to enlist some of the reasons as to why data & models change in an ML system:
- Data can be spread across several platforms
- Data may be frequently changing, requiring periodic updates to our model to perform well (think of click-through rate data for building recommender systems)
- Models can be retrained with this new data & perhaps newer training techniques
- Models may degrade over time
- Underperforming models can be rapidly reverted to an earlier serving version
What Is Data Version Control (DVC)?
Data version control (the concept, not the specific tool!) is a collection of tools and procedures that attempt to adapt the version control process to the world of data. One such tool that helps data scientists govern their data and models and run reproducible experiments is DVC or Data Version Control.
DVC, developed by Iterative.AI is an open source command-line tool written in Python for data science & ML project management & versioning. To take advantage of the existing toolset that most developers are familiar with, it emulates Git commands and workflows so that they may rapidly integrate it into their usual Git practice.
DVC uses a remote repository (including supports all major cloud providers) to store all the data and models for a project. In the actual code repository, a pointer to this remote location is stored to access the actual artifacts.
Features of DVC
Using DVC brings agility, reproducibility, and collaboration into your existing data science workflow. Some of the core features of DVC are:
- Git-compatible: It runs on top of any Git repository and is compatible with any standard Git server or provider (GitHub, GitLab, etc.)
- Simplistic Data Versioning: Several versions of data & models are maintained by replacing large files, dataset directories, ML models, etc. with small metafiles (as placeholders) containing pointers to the original data
- Storage-agnostic: It can use Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network-attached storage, or disc as a remote to store data
- Reproducible: It makes ML projects reproducible by creating lightweight pipelines using implicit dependency graphs, and by codifying the data and artifacts involved
- Language- & framework-agnostic: It is independent of the programming languages (Python, R, Julia, shell scripts, and so on) or machine learning libraries (Keras, Tensorflow, PyTorch, Scipy, and so on) used in the project
- Low friction branching: It supports instantaneous Git branching, even with large files & also prevents data duplication across experiments
- Easy to use: It is quick to install and doesn’t require special infrastructure
Note: DVC does NOT replace Git! The metafiles that hold references to original data and models in the remote vary with the data during the course of the project & require Git to be version controlled.
Comparing DVC to other Solutions
Apart from DVC, there are several other top tools for data version control that can help automate work and optimize processes. Here’s a quick comparison between DVC & some of the other tools:
Exploring the Basic Functionalities of DVC
With the basic overview of DVC, let’s dive straight into using DVC for our own ML project to understand the various functionalities that it offers. In this tutorial, we will learn how to:
- Install DVC & initialize a project
- Add files to a remote repository (Amazon S3) & retrieve them later
- Switch between different versions
- Search & import files from a DVC repo
Later, we will also look into using DVC as a Python package.
DVC can be installed from its repository or binary package based on the user’s OS, the steps for which can be found in the docs.
However, since DVC can also be used as a Python library (we will explore this later), it can simply be installed using a package manager like
conda. Depending on the type of remote storage being used in the project, optional dependencies may have to be installed.
When using such package managers, it is advisable to create & enable a virtual environment before installing the package.
Based on our preferred package manager, we will install DVC (with Amazon S3 remote) as follows:
$ pip install "dvc[s3]" # Using pip$ conda install -c conda-forge dvc-s3 # Using conda
In this case, it installs the
boto3 library along with DVC.
Feel free to check out the DVC documentation to install dependencies for other remotes that you may want to use.
Now, you can type
dvc --help to view all the commands that can be used with DVC.
After installing DVC, we create a project directory & initialize it as a Git repo. Create a corresponding GitHub repo & set it as the
origin remote for the Git repo. Later, we use
dvc init to initialize it as a DVC repo. This is our workspace.
$ mkdir mlops_dvc
$ cd mlops_dvc$ git init
$ git remote add origin <github-repo-link>
$ git branch -M main$ dvc init
This will create an internal
.dvc folder with the structure mentioned below, that holds internal directories and files needed for DVC operation.
│ ├── .gitignore
│ ├── config
│ ├── plots
│ │ ├── confusion.json
│ │ ├── confusion_normalized.json
│ │ ├── default.json
│ │ ├── linear.json
│ │ ├── scatter.json
│ │ └── smooth.json
│ └── tmp
│ ├── links
│ │ └── cache.db
│ └── md5s
│ └── cache.db
config: Configuration file, edited using
plots/: Directory for plot templates
tmp/: Directory for miscellaneous temporary files
links/: SQLite DB storing file links from cache to workspace
md5s/: SQLite database storing MD5 hashes for tracked files
.dvcignore: Files and directories excluded when traversing a DVC project
These include the configuration files, local cache, plot templates & temporary files (more information). It is similar to the
.git/ folder created on the initialization of a Git repo. The
dvc init command also creates a
.dvcignore file (similar to
.gitignore) to contain a list of paths for DVC to ignore.
Now, we commit these internal files to the Git repo using
git commit -m "Initialize DVC in repo".
From here on, we will create and manage different DVC files and populate the cache as we use DVC and work on our ML project.
Tracking a File using DVC
Once our DVC repo is initialized, we can start adding data & code to our Git repo, as required by our ML project. It is good to maintain a good directory structure for your ML project to separate the data, code, models, metrics, etc. Following is a very generic directory structure that can be adopted for most ML projects:
├── data/ # Directory with raw and intermediate data
│ └── prepared/ # Processed dataset
├── metrics/ # Plots & logs with performance metrics
├── models/ # Trained models
└── src/ # Files for processing, training, evaluation
data/ directory may usually have training data in the form of files (like XML, CSV, TSV, etc.) or in folders (like images belonging to various classes) which may be of huge sizes. For this tutorial, we will download this toy dataset (
data.csv) & place it in the
mlops_dvc/data/ folder. Now, type in the following commands in the terminal to enable tracking of the
$ dvc add data/data.csv # To start tracking a file or directory
With this, DVC saves information about the newly added files (or directories) in a special
.dvc file named
data/data.csv.dvc (a tiny text file in a readable format). This metadata file serves as a placeholder for the actual data and can be readily versioned using Git, much like source code. Its contents are as follows:
- md5: 97edcee3524b1675ab68790ad35e80e9
In the background,
dvc add moves the data to the project's cache, and links it back to our workspace. You can verify that
.dvc/cache/ folder contains a subfolder
97, having a file named
edcee3524b1675ab68790ad35e80e9. The MD5 hash value of the
data.csv file we just added (
97ecdce...) determines this cache path.
Now, we track the
data.csv.dvc file using Git, while ignoring the actual
$ git add data/data.csv.dvc data/.gitignore # Adds the actual data/data.csv file to .gitignore so that it is ignored by Git tracking
$ git commit -m "Add raw data" # Commit the data/data.csv.dvc for Git tracking
Storing Data Remotely
Once data is tracked using DVC, it can be safely stored remotely (DVC supports several remote storage options, as discussed previously).
Setting up the Amazon S3 Bucket
Since we plan to store the data in Amazon S3, we will first have to set up our S3 bucket that will store our data as follows:
Step 1: Log into your AWS Console
Step 2: Search for “s3” in the search box & select the S3 service
Step 3: Click on “Create Bucket” in the S3 Dashboard
Step 4: Enter a Bucket Name & select the “Asia Pacific (Mumbai) ap-south-1” AWS Region. For example, I have set the bucket name as
mlopsdvc170100035. (This is because S3 bucket names must be unique within a partition - a group of regions) Leave all other settings default & hit the "Create Bucket" button.
Step 5: Now we need to create our Access Key that will enable DVC (on our local system) to communicate with the Bucket. For this, expand the menu under your profile name & click “My Security Credentials”.
Step 6: Click the “Access keys (access key ID and secret access key)” & then the “Create New Access Key” button. This will create a unique access key pair for you.
You should download the key file & keep it safe because this is the only time that you will be shown your secret access key.
Step 7: Now, create 2 environment variables named
AWS_SECRET_ACCESS_KEY containing your access key & secret access key respectively, which will be picked up by the
boto3 library when we try to upload/download data from our S3 bucket. Also, create & set the
AWS_DEFAULT_REGION variable to
Reference for Windows Users
Reference for Linux Users
With this, we have set up our AWS Credentials & Amazon S3 bucket to store our data remotely using DVC.
Pushing Data to Remote Storage
To store the DVC-tracked data & model files remotely, we first need to configure the remote storage location as follows:
$ dvc remote add -d storage s3://mlopsdvc170100035/dvcstore
# Enter your respective bucket name$ git add .dvc/config
$ git commit -m "Configure remote storage"
The first command creates a remote named
storage & links it to a (newly created) folder
dvcstore in our S3 bucket. The
-d flag ensures that this remote is the default one used for uploading & downloading data.
Next, we push our data into the remote storage (
dvc push copies the data cached locally to the remote storage) & later push to our GitHub repo:
$ dvc push
$ git push origin main
After completing the above steps, you can check your Amazon S3 console to find that the data has been pushed into your bucket.
On inspecting your GitHub repo, you should notice that the
data/ folder contains only the
data.csv.dvc files, whereas the actual
data.csv file isn't stored on GitHub.
Retrieving Remotely Stored Data
DVC-tracked data & models may be retrieved and used in other versions of this project using
dvc pull as needed. It is often executed after
git clone &
To simulate this situation in this tutorial, the
.dvc/cache/ folder &
data/data.csv files need to be deleted, following which we pull from the DVC remote.
$ rm -rf .dvc/cache
$ rm -f data/data.csv$ dvc pull
We can easily verify that the
data.csv file (of ~150MB) has been downloaded from our remote to our workspace.
Making Modifications to the Data
In real-world ML projects, we may use multiple versions of the data & models across the various experiments. Some potential causes of this may be:
- Different pre-processing techniques being used in different experiments
- The inflow of new data into the system from time to time from some external source
- Newer models trained with different hyperparameters
Under such circumstances, DVC helps track these changes effectively. To simulate such a situation, we double our original dataset as follows:
# Copy the contents of data.csv to a temporary location
$ cp data/data.csv /tmp/data.csv
# Append contents from this location to the end of data.csv
$ cat /tmp/data.csv >> data/data.csv
We can verify this modification by noting that the size of this new
data.csv file is twice that of the old file. Now, we are ready to track this change & upload the new file to our S3 bucket:
$ dvc add data/data.csv
$ git add data/data.csv.dvc
$ git commit -m "Update dataset"
$ dvc push
It must be evident that pushing this new file takes almost twice as long as the previous push due to the increase in the size of the file
You can verify the push checking that the MD5 hash in the updated
data/data.csv.dvc file (
7663219...) corresponds to the file path in the S3 bucket.
Now you can push to GitHub as well to track the new changes.
Switching between Versions of Data
Once we have several versions of our data & models, it is obvious that we may need to switch between these versions time & again. Just as Git has the
checkout command to switch between different versions of our code, DVC offers the
checkout command to synchronize the data & models according to the
.dvc files. The usual practice is to run
git checkout <...> (to switch a branch or checkout a
.dvc file version), followed by
To get to the previous version of our dataset & commit, we perform the following:
$ git checkout HEAD~1 data/data.csv.dvc
$ dvc checkout
$ git commit data/data.csv.dvc -m "Revert updates"
Note that we don’t need to
dvc push again because this version of the dataset is already stored in our Amazon S3 bucket. Thus, we can say that DVC isn't even a version control system in the strictest sense! Data file versions are defined by the contents of the
.dvc file, while version control is provided through Git. DVC then produces these
.dvc files, changes them, and efficiently synchronizes the DVC-tracked data in the workspace to match them.
Accessing Data Tracked by DVC
Once we can track our data & models effectively, the very next thought that arises is how can we access these tracked artifacts to reuse the datasets or deploy a model? In this section, we will discuss the way we can look for relevant files in a DVC-tracked repository & download data from it.
Searching for Files/Directories using DVC
To download data & models, we first need to find where they are located & if they are accessible. To explore a DVC repository hosted on any Git server, the command
dvc list can be used. It lists the project contents, including files, models, and directories tracked by both DVC & Git. The syntax for this command is as follows:
dvc list <repo-url> [<file-path>] (
[...] implies that the
<file-path> is optional)
As an example, running
dvc list (in some other directory) on your respective GitHub repo that you have been using for this tutorial yields:
$ dvc list <https://github.com/><your-username>/<repo-name>
$ dvc list <https://github.com/><your-username>/<repo-name> data
Downloading Relevant Files & Directories
dvc get &
dvc import can be used to download files or directories from a DVC-tracked repository. However,
dvc get downloads the files without maintaining any of the metadata that connects it to the project, whereas
dvc import also creates the corresponding
.dvc files to be saved in the project.
dvc import is essentially a combination of
dvc get &
dvc add applied together.
The syntax of these commands is as follows:
dvc get/import <repo-url> <file-path> -o <output-file-path>
data.csv file stored in our S3 & tracked by the
data/data.csv.dvc file can be downloaded into any project as follows:
$ dvc get https://github.com/<your-username>/<repo-name> data/data.csv -o data_downloaded/data.csv$ dvc import https://github.com/<your-username>/<repo-name> data/data.csv -o data_downloaded/data.csv
dvc get can be called from anywhere, but
dvc import must be called from a DVC-tracked repository.
DVC Python API
As mentioned previously, DVC can be used as a Python library as well by importing like any other Python module using
import dvc.api. The goal of this API is to give developers some programmatic access to the data or models versioned in DVC-tracked repositories. Two of these library functions that we will go through are:
Given the DVC-tracked GitHub repository URL (
repo) & path to file (
path), it returns the URL string of the storage location where the actual file is stored in the DVC remote.
import dvc.apix = dvc.api.get_url(repo="https://github.com/<your-username>/<repo-name>", path="data/data.csv")
# x stores the URL of the corresponding file in your S3 bucket
# (like s3://mlopsdvc<your-roll-number>/dvcstore/<path-to-file>)
Given the DVC-tracked GitHub repository URL (
repo), the path to file (
path) & mode in which the file is opened (
mode), it opens the data/model file & generates the corresponding file object.
import pandaswith dvc.api.open(
) as fd:
df = pandas.read_csv(fd)
Using DVC on a Shared Server
Usually, individual laptops are not used for computationally intensive work in many academic and professional contexts, since they are not powerful enough to handle huge volumes of data or intense processing. Teams instead rely on the cloud or on-premises workstations, where multiple people frequently collaborate on the same development server. Although this allows better resource utilization such as GPU access & centralized data storage, there is a high chance of unnecessary duplication of the same data across several users & repositories if is not managed correctly.
Given this setting, DVC can play a crucial role in maintaining a shared cache on the development server such that all users can have near-instantaneous workspace restoration and switching speeds simply using the
dvc checkout command.
After a DVC repository is initialized with
dvc init, by default DVC puts the cache in the repository’s
.dvc/cache folder. To create & use a shared cache on the server, a directory for the external cache must be created configured with appropriate permissions & linking.
Given below is a step-by-step example of setting up a shared DVC cache on a server & using it in a DVC-tracked repository:
# Create a folder to behave as external cache
$ mkdir -p /home/shared/dvc-cache # Move cache from current repo to external cache (optional)
$ mv .dvc/cache/* /home/shared/dvc-cache # Set cache of DVC-tracked repository as folder made above
$ dvc cache dir /home/shared/dvc-cache # Set group permissions on new cache files
$ dvc config cache.shared group # Enable symlinks to avoid copies from cache to workspace
$ dvc config cache.type symlink
On inspective the
.dvc/config file in the repository, the following section appears:
dir = /home/shared/dvc-cache
shared = group
type = symlink
cache.type could be
copies. Once any change is made to the
cache.type , DVC needs to be told explicitly about it using
dvc checkout --relink before the change takes place.
In this post, we began to actually put some of the MLOps principles into practice by using DVC to efficiently version & manage our data & models. As we have seen, DVC is a powerful & flexible tool, & we have just covered some of the fundamental features that it offers. Although there is a lot more to DVC in terms of creating reproducible ML pipelines, I choose not to cover those in this series because, in the upcoming blogs, I will introduce a couple of more tools & frameworks for creating & managing ML pipelines end-to-end more effectively. Nevertheless, interested folks can feel free to go through the DVC documentation to understand the capabilities of DVC.
I hope that you found DVC interesting & would encourage you to use it henceforth in your ML & Data Science projects to maintain a clean workspace. Also, feel free to check out the procedure to use Google Drive as a DVC remote storage.
In the next blog, we will shift our attention from versioning to using the data & create end-to-end ML pipelines for performing various kinds of experiments using an interesting & powerful Python library called PyCaret.
Following are the other parts of this Fundamentals of MLOps series:
- Part 1: A Gentle Introduction to MLOps
- Part 3: ML Experimentation using PyCaret
- Part 4: Model Tracking with MLFlow & Deployment with FastAPI
Thank you & Happy Coding!
About the Author
Hey folks! I’m Tezan Sahu, a Data & Applied Scientist at Microsoft. I completed my B.Tech from IIT Bombay with a Major degree in Mechanical Engineering and a Minor degree in Computer Science & Engineering. I have a keen interest in NLP, Deep Learning & Blockchain, & love to develop interesting products using cutting-edge technologies.
Website: Tezan Sahu | Microsoft
LinkedIn: Tezan Sahu | LinkedIn
Email ID: email@example.com