featurepreneur
Published in

featurepreneur

What is DVC(Data Version Control) and How to get started?

Always using a version control system such as Git for source code is a good practice and an industry standard. But tracking large data becomes impossible in Git. Projects that are data-intensive such as deep learning projects, strongly rely on the quality of the dataset to produce results . Then, what’s the solution? Is there any approach such as data versioning?

Yes , The Data Version Control (DVC) project aims at bringing Git in projects that use a lot of data. You often find in such projects a link of some sort to download the data, or sometimes Git LFS is used. It is a command-line tool written in Python. It mimics Git commands and workflows to ensure that users can quickly incorporate it into their regular Git practice. The workflow of DVC is

Getting Started with DVC:

  • Create a virtual env ,install dvc and clone the git repository to be versioned.
conda create --name dvc python=3.8.2 -y 
pip install dvc
  • Then create your branch in Git for each version.
git checkout -b "V1.0"
  • Now you can initiate the dvc by using the below command , then you can find a .dvc folder in your repository and by default dvc stores the analytics of the model, so we can set it to false.
dvc init
dvc config core.analytics false
  • In dvc , you need to have remote storage for storing large file , It can be Googledrive , S3 bucket etc .Here, we use the local storage. PATH is the path of the remote storage folder in your computer.
dvc remote add -d remote_storage PATH
  • To add a folder in the tracking list of DVC ( Data folder).
dvc add PATH
  • Now your data is copied to the cache and a .dvc file will be created in place of the csv file. eg: Train.csv -> Train.dvc . In each step after add folder to dvc ensure to add in git and commit them to github.
git add --all
git commit -m" First Commit "
  • Then push the cache to remote in dvc and also in github using new branch as default.
dvc push
git push --set-upstream origin BRANCH NAME
  • If you want to extract the data in another setup, Use the following commands.
dvc checkout
dvc fetch

Conclusion:

TAADAA! , You have just versioned your data using few lines of commands , DVC is also useful in reproducible pipelines and reusability . For more details , refer dvc.org. You and your teammates will be happy to save the changes in your data in Git with simple command lines like above while storing your data in your favorite storage service!. Check out my repo on dvc here.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store