May ’19 DVC❤️Heartbeat

Svetlana Grinchenko
Data Version Control
8 min readMay 21, 2019

Every month we are sharing here our news, findings, interesting reads, community takeaways, and everything along the way.

Some of those are related to our brainchild DVC and its journey. The others are a collection of exciting stories and ideas centered around ML best practices and workflow.

Kudos to StickerMule.com for our amazing stickers (and great customer service)!

News and links

This section of DVC Heartbeat is growing with every new Issue and this is already quite a good piece of news!

One of the most exciting things we want to share this month is acceptance of DVC into the Google Season of Docs. It is a new and unique program sponsored by Google that pairs technical writers with open source projects to collaborate and improve the open source project documentation. You can find the outline of DVC vision and project ideas in this dedicated blogpost and check the full list of participating open source organisations. Technically the program is starting in a few months, but there is already a fantastic increase in the amount of commits and contributors, and we absolutely love it!

The other important milestone for us was the first offline meeting with our distributed remote team. Working side by side and having non-Zoom meetings with the team was amazing. Joining our forces to prepare for the upcoming conferences turned out to be the most valuable, educating and uniting experience for the whole team.

It’s a shame that our techlead was unable to join us it due to another visa denial. We do hope he will finally make it to the USA for the next big conference.

While we were busy finalizing all the Pycon 2019 prep, our own Dmitry Petrov flew to New York to speak at the O’Reilly AI Conference about the Open Source tools for Machine Learning Models and Datasets versioning. Unfortunately the video is available for the registered users only (with a free trial option) but you can have a look at Dmitry’s slides here.

We renamed our Twitter! Our old handle was a bit misleading and we moved from @Iterativeai to @DVCorg (yet keep the old one for future projects).

Our team is so happy every time we discover an article featuring DVC or addressing one of the burning ML issues we are trying to solve. Here are some of our favorite links from the past month:

Version Control For Your Machine Learning Projects — Episode 206 by Tobias Macey

Version control has become table stakes for any software team, but for machine learning projects there has been no good answer for tracking all of the data that goes into building and training models, and the output of the models themselves. To address that need Dmitry Petrov built the Data Version Control project known as DVC. In this episode he explains how it simplifies communication between data scientists, reduces duplicated effort, and simplifies concerns around reproducing and rebuilding models at different stages of the projects lifecycle.

Here is an article by Favio Vázquez with a transcript of this podcast episode.

Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis followed up by a comprehensive discussion on Reddit.

With Git-LFS your team has better control over the data, because it is now version controlled. Does that mean the problem is solved? Earlier we said the “key issue is the training data”, but that was a lie. Sort of. Yes keeping the data under version control is a big improvement. But is the lack of version control of the data files the entire problem? No.

Discord gems

There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.

We are sifting through the issues and discussions and share with you the most interesting takeaways.

We feared that too until we met them in person. They appeared to be real (unless bots also love Ramen now)!

Every time you run dvc add to start tracking some data artifact, its path is automatically added to the .gitignore file, as a result it is hard to commit it to git by mistake — you would need to explicitly modify the .gitignore first. The feature to track some external data is called external outputs (if all you need is to track some data artifacts). Usually it is used when you have some data on S3 or SSH and don’t want to pull it into your working space, but it’s working even when your data is located on the same machine outside of the repository.

Use dvc import to track and download the remote data first time and next time when you do dvc repro if data has changed remotely. If you don’t want to track remote changes (lock the data after it was downloaded), use dvc run with a dummy dependency (any text file will do you do not touch) that runs an actual wget/curl to get the data.

Almost any command in DVC that deals with pipelines (set of DVC-files) accepts a single stage as a target, for example dvc pipeline show — ascii model.dvc.

It’s a well known problem with NFS, CIFS (Azure) — they do not support file locks properly which is required by the SQLLite engine to operate. The easiest workaround — don’t create a DVC project on network attached partition. In certain cases a fix can be made by changing mounting options, check this discussion for the Azure ML Service.

An excellent question! The short answer is:

dvc cache dir --local — to move your data to a big partition;

dvc config cache.type reflink, hardlink, symlink, copy — to enable symlinks to avoid actual copying;

dvc config cache.protected true — it’s highly recommended to make links in your working space read-only to avoid corrupting the cache;

To add your data first time to the DVC cache, do a clone of the repository on a big partition and run dvc add to add your data. Then you can do git pull, dvc pull on a small partition and DVC will create all the necessary links.

Usually it means that a parent directory of one of the arguments for dvc add / dvc run is already tracked. For example, you’ve added the whole datasets directory already. And now you are trying to add a subdirectory, which is already tracked as a part of the datasets one. No need to do that. You could dvc add datasets or dvc repro datasets.dvc to save changes.

Check the locale settings you have (locale command in Linux). Python expects a locale that can handle unicode printing. Usually it’s solved with these commands: export LC_ALL=en_US.UTF-8 and export LANG=en_US.UTF-8. You can place those exports into .bashrc or other file that defines your environment.

In short — yes, but it can be also configured. DVC is going to use either your default profile (from ~/.aws/*) or your env vars by default. If you need more flexibility (e.g. you need to use different credentials for different projects, etc) check out this guide to configure custom aws profiles and then you could use them with DVC using these remote options.

  • How can I output multiple metrics from a single file? Let’s say I have the following in a file: {“AUC_RATIO”: {“train”: 0.8922748258797667, “valid”: 0.8561602726251776, “xval”: 0.8843431199314923}}. How can I show both train and valid without xval?

You can use metrics show command XPath option and provide multiple attribute names to it:

$ dvc metrics add model.metrics --type json --xpath AUC_RATIO[train,valid]
metrics.json:
0.89227482588
0.856160272625

There are a few options to add a new dependency:

The only recommended way so far would be to somehow make DVC know about your package’s version. One way to do that would be to create a separate stage that would be dynamically printing version of that specific package into a file, that your stage would depend on:

dvc run -o mypkgver 'pip show mypkg > mypkgver’
dvc run -d mypkgver -d ... -o .. mycmd

Yes, you could dvc commit -f. It will save all current checksum without re-running your commands.

Yes! This DVC features is called external outputs and external dependencies. You can use one of them or both to track, process, and version your data on a cloud storage without downloading it locally.

If you have any questions, concerns or ideas, let us know here and our stellar team will get back to you in no time!

--

--