March ’19 DVC❤️Heartbeat

Svetlana Grinchenko
Data Version Control
4 min readMar 5, 2019

This is the very first issue of the DVC❤️Heartbeat. Every month we will be sharing our news, findings, interesting reads, community takeaways, and everything along the way.

Some of those are related to our brainchild DVC and its journey. The others are a collection of exciting stories and ideas centered around ML best practices and workflow.

News and links

We read a ton of articles and posts every day and here are a few that caught our eye. Well-written, offering a different perspective and definitely worth checking.

‘What is becoming clear is that, in the late stage of the hype cycle, data science is asymptotically moving closer to engineering, and the skills that data scientists need moving forward are less visualization and statistics-based, and more in line with traditional computer science curricula.’

‘I want to explore how the degrees of freedom in versioning machine learning systems poses a unique challenge. I’ll identify four key axes on which machine learning systems have a notion of version, along with some brief recommendations for how to simplify this a bit.’

‘…the objective of this post is not to philosophize about the dangers and dark sides of AI. In fact, this post aims to work out common challenges in reproducibility for machine learning and shows programming differences to other areas of Computer Science. Secondly, we will see practices and workflows to create a higher grade of reproducibility in machine learning algorithms.’

DVC Discord gems

There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.

We will be sifting through the issues and discussions and share the most interesting takeaways.

There is no separate guide for that, but it is very straight forward. See DVC file format description for how dvc file looks inside in general. All `dvc add` or `dvc run` does is just computing md5 fields in it, that is all. You could write your dvc file and then run dvc repro that will run a command(if any) and compute all needed checksums … read more

…There’s a ton of code in that project, and it’s very non-trivial to define the code dependencies for my training stage — there are a lot of imports going on, the training code is distributed across many modules … read more

DVC officially only supports regular Azure blob storage. Gen1 Data Lake should be accessible by the same interface, so configuring a regular azure remote for dvc should work. Seems like Gen2 Data Lake has disable blob API. If you know more details about the difference between Gen1 and Gen2, feel free to join our community and share this knowledge.

Apache 2.0. One of the most common and permissible OSS licences.

$ dvc remote add upstream s3://my-bucket
$ dvc remote modify upstream region REGION_NAME
$ dvc remote modify upstream endpointurl <url>

Find and click the `S3 API compatible storage` on this page

… it adds your datafiles there, that are tracked by dvc, so that you don’t accidentally add them to git as well you can open it with file editor of your liking and see your data files listed there.

… with dvc, you could connect your data sources from HDFS with your pipeline in your local project, by simply specifying it as an external dependency. For example let’s say your script `process.cmd` works on an input file on HDFS and then downloads a result to your local workspace, then with DVC it could look something like:

$ dvc run -d hdfs://example.com/home/shared/input -d process.cmd -o output process.cmd

read more.

If you have any questions, concerns or ideas, let us know here and our stellar team will get back to you in no time.

--

--