Data Version Control with DVC and Git
The scope of this article is to introduce you to DVC and give a brief guide to start to use it.
What is DVC ?
Designed to handle large files, datasets, machine learning models, metrics, and your code, DVC is an open source project that makes ML models reproducible and shareable.
The philosophy behind DVC tools is to bring best practices of software development into Machine Learning.
Code and data are linked so that we don’t need to worry about tracking them separately every time we run an experiment.
How does data versioning work ?
Source code can be versioned using Git. For collaboration, we typically use GitHub, GitLab, or a self-hosted Git server, in addition to a local copy. Local and remote files can be synced using the git push and git pull commands.
While Git works well with small files, it struggles with larger files (such as datasets or models). The DVC system is built on top of Git and manages data similarly.
In DVC, metadata about different file versions is stored in the Git history. In this way, Git commits are linked to our data and models.
When we tell DVC to track large files or directories with dvc add, the data is moved to a DVC cache (our local copy). Git is much more friendly to small files than large ones, so DVC creates a tiny metadata file that can be committed instead of the actual data files.
Metadata files contain information about where data is stored. Typically, this will be a remote storage location, such as AWS S3, Google Cloud, Azure, or an SFTP server.
What happens under the hood?
We can run dvc add in verbose mode to get a little more insights into how DVC does its data versioning:
dvc add -v
Computed stage: 'file.txt.dvc' md5: 'None'
Preparing to transfer data from 'memory://dvc-staging/85207a32b6773e0046773419df47e57753028701c9ec0b10d8ca782f0c8032a3' to '/home/coding/dvc_repo/.dvc/cache'
Preparing to collect status from '/home/coding/dvc_repo/.dvc/cache'
Collecting status from '/home/coding/dvc_repo/.dvc/cache'
Preparing to collect status from 'memory://dvc-staging/85207a32b6773e0046773419df47e57753028701c9ec0b10d8ca782f0c8032a3'
Removing '/home/coding/dvc_repo/.m7XMCEPHPZ8siiCP9atGzJ.tmp'
Removing '/home/coding/dvc_repo/.dvc/cache/.Bih4nnvvbQa6RfSnrTFJnC.tmp'
Removing '/home/coding/dvc_repo/file.txt'
Saving information to 'file.txt.dvc'.
Initially, DVC checks if a metafile has already been created for the file we added. As md5: ‘None’ shows, that’s not the case here.
Afterwards, DVC saves the data to the DVC cache, and then removes the original. As a final step, all relevant information is saved to the DVC metafile.
outs:
- md5: daff69620dbb16d76b1117013254f7aa
size: 27
path: file.txt
Metafiles contain MD5 hashes of files and their locations.
The DVC will retrieve the file Daff69620dbb16d76b1117013254f7aa every time it is instructed to retrieve file.txt in the current directory.
Since the metafile is versioned by Git, DVC ties data to specific code versions.
A change to our dataset would constitute an updated DVC metafile and thus a new commit in our Git history.
In this case the DVC cache we have thus far been using is located on our local machine. Typically, however, we would also want to version our data on remote storage.
The remote is where DVC sends our data for persistent storage. It is analogous to remote Git repo (ex GitHub or GitLab).
DVC supports a wide variety of remote storages, including Amazon S3, Google Cloud Storage, Azure Blob Storage and others.
Versioning Example
Now we are going to implement a guide to how versioning data with DVC and Git.
After creating our environment
$ conda create --name dvc python=3.8.2 -y
$ conda activate dvc
The first thing to do after creating the folder is to initialize git and dvc
$ mkdir dvc_dir && cd dvc_dir
$ git init
$ dvc init
We use the dvc get function to download a dataset in in the folder created earlier
$ mkdir data
$ dvc get https://github.com/iterative/dataset-registry/get-started/data.xml -o data/data.xml
Inside our data directory we have a new file called data.xml
$ ls -lh data/
>> -rw-rw-r-- 1 coding coding 14M nov 11 18:47 data.xml
Now that we’ve got the data, we’re going to add it to DVC tracking.
$ dvc add data/data.xml
>> To track the changes with git, run:
>> git add data/.gitignore data/data.xml.dvc
>> To enable auto staging, run:
>> dvc config core.autostage true
Let’s add the two new files that were generated when we ran dvc add, as DVC suggests above, and commit them.
$ git add data/.gitignore data/data.xml.dvc
$ git commit -m "Add raw data"
$ ls data/
>> data.xml data.xml.dvc
Let’s check the data.xml.dvc file inside:
$ cat data/data.xml.dvc
>> outs:
>> - md5: 22a1a2931c8370d3aeedd7183606fd7f
>> size: 14445097
>> path: data.xml
This is a standard .dvc file with only one output (outs field). The hash value (md5 field) corresponds to the file path in the cache.
At this point we’ve got a dataset that’s being tracked by dvc but the dataset lives on our machine. We want to push it into our remote storage, so to do that we’re going to add a DVC remote storage. We are going to do this with google drive, one of the simplest ones to set
$ dvc remote add -d storage gdrive://gdrive_id_folder
The information about the storage is located in the DVC config file.
$ cat .dvc/config
>>[core]
>> remote = storage
>>['remote "storage"']
>> url = gdrive://gdrive_id_folder
Now we’re going to commit our dvc config file to save that we’ve added and push it in DVC remote storage.
$ git commit .dvc/config -m "Configure remote storage"
$ dvc push
As we can see, the file has been uploaded to dvc remote storage.
Let’s try to extract the data from the DVC archive, before doing this, we’re going to remove the files already there to avoid duplicates.
$ rm -f data/data.xml && rm -rf .dvc/cache/
$ dvc pull
It might happen that you have the following error
<HttpError 403 when requesting *www.googleapis.com link* returned "This file has been identified as malware or spam and cannot be downloaded". Details: "[{'domain': 'global', 'reason': 'abuse', 'message': 'This file has been identified as malware or spam and cannot be downloaded'}]">
In that case run this
$ dvc remote modify myremote gdrive_acknowledge_abuse true
Now we’re going to try changing our dataset and then we’re going to use DVC and git to see how we can move forward and backward in time with different versions of the dataset.
We will make an artificial change in our data, appending a copy of the data to the originals.
$ cp data/data.xml /tmp/data.xml
$ cat /tmp/data.xml >> data/data.xml
As we can see, the file size is doubled from the original
$ ls -lh data/
>> totale 28M
-rw-rw-r-- 1 coding coding 28M nov 11 19:55 data.xml
-rw-rw-r-- 1 coding coding 80 nov 11 18:49 data.xml.dvc
When we made this change, our dvc file also changed.
To get track of the change, we are going to execute a commit and push our latest version of the dataset in cloud storage.
$ dvc add data/data.xml
$ git add data/data.xml.dvc
$ git commit -m "Dataset updates"
$ dvc push
We can confirm that when we look in our repository where a new folder has appeared. Now we have two different versions of our dataset in cloud storage.
If we look in our git commit log we can see that there are some versions of the dataset.
$ git log --oneline
>> cd66481 (HEAD -> master) Dataset updates
>> 904541f Configure remote storage
>> 1e580dc Add raw data
Now we’re going to do a switch.
To get this, we can do a git checkout of a previous version.
$ git checkout HEAD^1 data/data.xml.dvc
>> Aggiornato 1 percorso da b46792c
After we’ve do that we can run dvc checkout
$ dvc checkout
Let’s check. We are back to the 14M version !!!
$ ls -lh data/
>> totale 14M
>> -rw-rw-r-- 1 coding coding 14M nov 11 20:04 data.xml
>> -rw-rw-r-- 1 coding coding 80 nov 11 20:03 data.xml.dvc
In this case, after git commit, we won’t run the dvc add command again since we already saved this version of the dataset in dvc.
$ git commit data/data.xml.dvc -m "Revert dataset updates"
>> [master 63b1ea1] Revert dataset updates
>> 1 file changed, 2 insertions(+), 2 deletions(-)
We are going to create a github repo to see how it will appear for this project.
$ git remote add -origin repo_url
$ git branch -M master
$ git push
The data folder contains our .dvc file inside which we find an address that points to the dataset located in the cloud storage.
Our DVC config file tells us where our storage is but it doesn’t contain the dataset itself.
Conclusion
While DVC isn’t technically a version control system — Git does version control, and DVC extends Git version control to files that want to stay outside of Git — it can help data scientists fix problems they had to face for years!
It allows you to version data and models for each run. Collaborate with team members without worrying about losing data or running out of disk space.
All this is made possible thanks to the use of a few commands that, if you have already used Git, you will already seem to know.