Versioning Artificial Intelligence Models with MLflow

Published in

Applied Artificial Intelligence

7 min readJan 24, 2019

You might be thinking: MLflow is still in Beta and yet another Medium story about it. Well, I will start saying that you are wrong. And why? Because it is easy to write a story telling how to run something, anything, on your local machine. However, when it comes to the point where things have to work in a production environment, the "works on my machine" jargon won't help.

In this story, I will show you how to run a solution with Project Jupyter notebooks, a remote MLflow tracking server, and two storage options: SFTP; and Minio.

Before we get into details, if you feel like "I don't want to go through all this story to be able to get stuff done!". A bit like the image below:

Please, help yourself and go straight here: Artificial Intelligence Engineering. But clap for it. ;)

Are you still here? I hope so!

Before we continue, let me explain my view on the current cluster of titles when it comes to people working on anything closer to Artificial Intelligence. Nowadays, we have the following position in the market:

Data Scientists
Machine Learning Engineers
Deep Learning Engineers
NLP Engineers / Researchers
Computer Vision Engineers / Researchers
Software Engineers in AI (believe me, I saw that!)

I'm sure I could fit some others there. But what is the point? Well, I think we should first try to get things done in a right way. For instance, keep things separated as they use to be in the past. What I mean is: there are engineers, making things work; and researches, creating things that someone will make work. And sometimes, we do find people capable of doing both. So, that's where I want to start with:

Artificial Intelligence Engineer

What is the difference when compared with all the positions listed above? Well, I call myself an Artificial Intelligence Engineer. And why is that? I consider myself capable of developing traditional Machine Learning Models, [Deep] Neural Networks, [Deep] Convolutional Neural Networks, LSTMs, GRUs, Attention Models, Fusion Models, Generative Adversarial Neural Networks, I created an Activation function (SineReLU), etc.

But it doesn't stop there. I have been Engineering Software for the last 25 years. So, I also know about some programming languages, methodologies, databases, cloud computing, etc.

"Enough! Stop talking about you!", one would say and I would actually agree. To make Artificial Intelligence evolve, the market doesn't need only Data Scientist whom capable of cleaning data and thinking about feature extraction, it needs people capable of building the models, the software around it, the infrastructure and serve it in an end-to-end fashion. I don't mean everybody has to be able to do all of it, but every team needs at least one person that could and that person would be called the Artificial Intelligence Engineer.

If you are still reading, thanks! Now, let's get to something more interesting.

How hard could it be?

When we look at how Software Engineering evolved and all the ecosystem that supports it, we think: don't we have the same with AI powered systems?

Let's try a thought experiment here. Imagine you are going to start a proof of concept with Neural Networks and Computer Vision. The first things to do, in a sequence, would probably be those:

Create a new Jupyter notebook;
Copy a model from GitHub;
Pre-process you data to be able to feed the CNN model;
Train locally to see how it does;
Forget about the GPUs ($$$) and just deploy it.

Or, if you are a little bit more into it, you would do the following:

Create a new Jupyter notebook;
Get some pre-trained weights (e.g. CIFAR-10/100, AlexNet, etc.);
Copy a model model from Github;
Train locally to see how it does;
Forget about the GPUs ($$$) and just deploy it.

Well, if we would do another one, then the GPUs would be added, and that would be pretty much it.

In this whole process, not many people would possibly think about how to version the model, how to get the experiments done right and be able to replay them with the same set of hyper-parameters, how to store the weights in a way that they could be linked to the hyper-parameters, etc. It's a hard job to think about all those things and to actually get them to work properly.

If you haven't given up until now, that's cool! If you are doubting that I will actually get to the point: well, just check the link I shared to the Github repository all the way up there. But for now, please stay.

MLflow to the Rescue

Let me introduce you to MLflow. The first time I heard about it, about 4 months ago from the date of this writing, I thought it was a cool idea. The only thing I did not like about it was that it was introduced as a Databricks thingy. Okay, they are pretty much behind it, but it is still an open source project. So, it means that I could take my time to play around and build something with it without being locked to whatever Databricks would offer me.

And the time I took.

What does MLflow do? It offers the possibility to track models by logging information about metrics, parameters and artefacts. That's already something interesting because one can know for sure what is running on production and also keeps track of the evolution of the model.

Another interesting aspect of MLflow is the serving feature: it also enables serving, compatible with different ML and DL frameworks. The idea of serving is that you can, with a one-liner, start a server that you can use to post data to and get predictions from.

But what else? Well, it still has the deployment feature, which is compatible with Azure, Google Cloud and AWS Sage Maker. It is still quite interesting for a Beta version. If you want to hop on and know more, that are some basic stories mentioning the features in a bit more detail: ML Lifecycle.

Do you need more details about MLflow and its perks? Probably not. I think that if you look at what I have done with it, and the possibilities, it will bring some nice ideas to this head of yours.

What is a proper way to do it?

As I mentioned before, it's pretty easy to get MLflow running on your machine, I mean, the local machine. You pip install it and you are almost done. But what if you need it to get in a more, let's say, production ready way? Well, nothing is closer to production than having a MLflow remote tracking server with either SFTP or S3 storage configure. And to be honest, that's where we lack a lot of information and code from the existing stories.

Before we dig into the code, let's have a look at the architecture:

The whole idea, to mimic a real production environment, is to have the User using the services in a remote fashion. Hence, I built the environment using Docker containers with one container per service.

Although one might say that training models from within a Docker container in a MacBook is not better than training on the MacBook itself, it was done in this way to ease the communication between the services.

The idea behind the diagram above it to have the user running a few command lines and a whole environment is ready to be used, or tested. After cloning the repository, what one has to do is the following:

Open a terminal window
Type docker-compose up
Open a terminal tab
Type ./scripts/copy_known_hosts.sh
Type ./scripts/create_experiments.sh
Got to http://localhost:9000 and create a bucket called ai-models

After that, just go to JupyterLab (http://localhost:8888) and play around with the one existing notebook there.

This gives you a whole environment, from remote tracking server to storage (with two different protocols). Not to mention the JupyterLab playground.

How to get to production from here?

That's an easy one. If you already have an S3/SFTP compatible storage, just configure the MLflow image to use that storage instead of the one from the example.

If you don't have a storage, then get one! You cannot use your local disk as storage. However, if you cannot afford the storage money, just use the setup I created and make sure you backup the storage directory into an external disk so you can restore it if you need to.

All in all, do not get locked in because it doesn't take much to get a proper system running. In a team of five, at least one person should know how things work.

Are you still reading?

Thanks a lot for your time. I hope you have enjoyed it and that it can save you some time and help you to get your models properly versioned. Now, to the code and some more reading, please follow this link: Artificial Intelligence Engineering.

I have some other stories on DL + NLP + AWS that you might want to look at. So, stop reading now, clap and go to the next story. ;)

Don’t forget to give us your 👏 !