Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers

Ward Van Laer
Ixor
Published in
3 min readJul 3, 2019

Googling for machine learning frameworks to version data, track python models etc.. I was surprised to see that these tools are still rare, and the existing ones are still in its infancy. Currently we use a combination of two open-source tools as a part of our machine learning workflow: DVC and MLFlow.

Photo by Perry Grone on Unsplash

Version control for our data using DVC

In an ideal world, like Kaggle competitions, the dataset you use does not evolve in time. However, in projects we are working on — for example in the healthcare industry — the dataset might evolve. DVC is a handy tool to version our dataset using git, while storing our versioned data on external storage like Amazon S3.

Of course, every tool has its downsides. DVC hashes data to check if it changed and will push data to binary library files in the cloud. This means there are no real snapshots of the data available, which makes it difficult for us humans to compare different versions of the data. DVC doesn’t make data versioning very transparent, but it might well suit your needs.

Result tracking in MLFlow

MLFlow is still in its alpha phase. It offers three components, we use the tracking component. Using a simple command, MLflow will create a webserver to which all kinds of tracking-information can be sent: it’s possible to track model parameters, metrics and artifacts (e.g. images of result curves, plotted diagrams or model weights). These results can be checked in the UI, where every run is also linked to the coresponding git commit.

Tips and Tricks:

  • Make sure to make MLFlow logging optional by building a simple logging switch into your code. This will avoid putting a load of empty/incomplete runs in your MLFlow project when you are debugging.
  • Every MLFlow run captures the git hash to keep the code version tracked. However, you should (automatically) commit all codes updates before tracking an experiment to ensure consistency.

GPU accelerated training in Docker containers

Docker is a great tool designed to create, run and deploy applications by using containers. In these containers, all necessary parts and libraries are installed, it makes your application easily portable to other machines (think of it as a small virtual machine). It gets even more interesting if you use Nvidia-Docker to run your containers. Nvidia-Docker is espacially created for deep learning applications. It creates driver agnostic CUDA images, that spare you the headache of incompatible GPU drivers.

If you don’t want to build your docker container from scratch, there are some very interesting images avalaible on Docker HUB: the ufoym/deepo images have CUDA and cuDNN preinstalled, together with a selection of the most used machine learning libraries.

As you can see, it is possible to manage your work flow using open-source and free tools. However, using one of the emerging data science platforms could give you the advantage of having everything at one place and the possibility of user management when running projects.

At IxorThink we are constantly trying to improve our methods to create state-of-the-art solutions. As a software-company we can provide stable products from proof-of-concept to deployment. Feel free to contact us for more information.

--

--

Ward Van Laer
Ixor
Editor for

Machine Learning Engineer at Ixor | Magician