July ’19 DVC❤️Heartbeat

Published in

Data Version Control

7 min readAug 1, 2019

Every month we are sharing here our news, findings, interesting reads, community takeaways, and everything along the way.

Some of those are related to our brainchild DVC and its journey. The others are a collection of exciting stories and ideas centered around ML best practices and workflow.

Special edition DVC shirt. We made this one for Ruslan — DVC maintainer and the best tech lead.

News and links

As we continue to grow DVC together with our fantastic contributors, we enjoy more and more insights, discussions, and articles either created or brought to us by our community. We feel it is the right time to start sharing more of your news, your stories and your discoveries. New Heartbeat is coming soon!

Speaking of our own news — next month DVC team is going to the Open Source North America Summit. It is taking place in San Diego on August 21–23. Dmitry and Sveta will be giving talks and we will run a booth. So looking forward to it! Stop by for a chat and some cool swag. And if you are in San Diego on those days and want to catch up — please let us know here or on Twitter!

Open Source Summit + ELC North America 2019: Open Source Tools for ML Experiments Man...

Speakers Software Engineer, Iterative AI Ruslan is a Software Engineer at Iterative AI. Previously he worked on live…

ossna19.sched.com

Open Source Summit + ELC North America 2019: Speaker Preparation: Simple Steps with a...

Speakers Head of Developer Relations, DVC.org Svetlana is driving developer relations and community at DVC.org…

ossna19.sched.com

Every month our team is excited to discover new great pieces of content addressing some of the burning ML issues. Here are some of the links that caught our eye in June:

Principled Machine Learning: Practices and Tools for Efficient Collaboration by David Herron

Principled Machine Learning: Practices and Tools for Efficient Collaboration

Machine learning projects are often harder than they should be. The code to train an ML model is just software, and we…

dev.to

As we’ve seen in this article some tools and practices can be borrowed from regular software engineering. However, the needs of machine learning projects dictate tools that better fit the purpose.

First ML-REPA Meetup: Reproducible ML experiments hosted by Raiffeisen DGTL — check out the video and slide decks.

Machine Learning REPA

Анонсы мероприятий, проектов, обзоров инструментов и кейсов про ML проекты, управление экспериментами, автоматизацию и…

ml-repa.ru

ML-REPA is an a new fantastic resource for Russian-speaking folks interested in Reproducibility, Experiments and Pipelines Automation. Curated by Mikhail Rozhkov and highly recommended by our team.

How do you manage your machine learning experiments? discussion on Reddit is full of insights.

Discord gems

There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.

We are sifting through the issues and discussions and share with you the most interesting takeaways.

I have within one git repository different folders with very different content (basically different projects, or content I want to have different permissions to), and I thought about using different buckets in AWS as remotes. I’m not sure if it’s possible with DVC to store some files in some remote, and some other files in some other remote, is it?

You can definitely add more than one remote (see dvc remote add) and then dvc push has a -R option to pick which one to send the cached data files (deps, outs, etc) to. We would not recommend doing this though. It complicates the commands you have to run — you will need to remember to specify a remote name for every command that deals with data — push, pull, gc, fetch, status, etc. Please, leave a comment in the relevant issue here if this case is important for you.

Is that possible with DVC to have multiple (few) metric files and compare them all at once? For example, we’d like to consider as metrics the loss of a neural network training process (loss as a -M output of a training stage), and also apart knowing the accuracy of the NN on a test set (another -M output of eval stage).

Yes, it is totally fine to use -M in different stages. dvc metrics show will just show both metrics.

I have a scenario where an artifacts (data) folder is created by the dvc run command via the -o flag. I have manually added another file into or modified the artifacts folder but when I do dvc push nothing happens, is there anyway around this?

Let’s first do a quick recap on how DVC handles data files (you can definitely find more information on the DVC documentation site).

When you do dvc add, dvc run or dvc import DVC puts artifacts (in case of dvc run artifacts == outputs produced by the command) into .dvc/cache directory (default cache location). You don’t see this happening because DVC keeps links (or in certain cases creates a copy) to these files/directories.
dvc push does not move files from the workspace (that what you see) to the remote storage, it always moves files/directories that are already in cache (default is .dvc/cache).
So, now you’ve added a file manually, or made some other modifications. But these files are not in cache yet. The analogy would be git commit. You change the file, you do git commit, only after that you can push something to Git server (Github/Gitlab, etc). The difference is that DVC is doing commit (moves files to cache) automatically in certain cases — dvc add, dvc run, etc.

There is an explicit command — dvc commit-that you should run if you want to enforce the change to the output produced by dvc run. This command will update the corresponding DVC- files (.dvc extension) and will move data to cache. After that you should be able to run dvc push to save your data on the external storage.

Note, when you do an explicit commit like this you are potentially “breaking” the reproducibility. In a sense that there is no guarantee now that your directory can be produced by dvc run/dvc repro — since you changed it manually.

I’d like to transform my dataset in-place to avoid copying it, but I can’t use dvc run to do this because it doesn’t allow the same directory as an output and a dependency.

You could do this in one step (one stage). So that getting your data and modifying it, is one stage. So you don’t depend on the data folder. You just could depend on your download + modifying script.

Can anyone tell me what this error message is about? “To avoid unpredictable behaviour, rerun command with non overlapping outs paths.”

Most likely it means that there is a DVC-file that have the same output twice. Or there two DVC-files that share the same output file.

I’m getting “No such file or directory” error when I do dvc run or dvc repro. The command runs find if I don’t use DVC.

That happens because dvc run is trying to ensure that your command is the one creating your output and removes existing outputs before executing the command. So that when you run dvc repro later, it will be able to fully reproduce the output. So you need to make the script create the directory or file.

I’m implementing a CI/CD and I would like to simplify my CI/CD or even my training code (keeping them cloud agnostic) by using dvc pull inside my Docker container when initializing a training job. Can DVC be used in this way?

Yes, it’s definitely a valid case for DVC. There are different ways of organizing the storage that training machines are using to access data. From the very simple — using local storage volume and pulling data from the remote storage everytime — to using NAS or EFS to store a shared DVC cache.

I was able to follow the getting started examples, however now I am trying to push my data to Github, I keep getting the following error: “ERROR: failed to push data to the cloud — upload is not supported by https remote”.

HTTP remotes do not support upload yet. Example Get Started repository is using HTTP to keep it read-only and abstract the actual storage provider we are using internally. If you actually check the remote URL, you should see that it is an S3 bucket and AWS provides an HTTP end-point to read data from it.

I’m looking to configure AWS S3 as a storage for DVC. I’ve set up the remotes and initialized dvc in the git repository. I tried testing it by pushing a dataset in the form of an excel file. The command completed without any issues but this is what I’m seeing in S3. DVC seems to have created a subdirectory in the intended directory called “35” where it placed this file with a strange name.

This is not an issue, it is an implementation detail. There’s no current way to upload the files with the original filename (In this case, the S3 bucket will have the file data.csv but with another name 20/893143…). The reason behind this decision is because we want to store a file only once no matter how many dataset versions it’s used in. Also, it’s a reliable way to uniquely identify the file. You don’t have to be afraid that someone decided to create a file with the same name (path) but a different content.

Is it possible to only have a shared ‘local’ cache and no remote? I’m trying to figure out how to use this in a 40 node cluster which already has very fast NFS storage across all the nodes. Not storing everything twice seems desirable. Esp. for the multi-TB input data

Yes and it’s one of the very common use case, actually. All you need to do is to use dvc cache dir command to setup an external cache. There are few caveats though. Please, read this link for an example of the workflow.

If you have any questions, concerns or ideas, let us know in the comments below or connect with DVC team here. Our DMs on Twitter are always open, too.