Continuous deployment in Machine Learning systems

14 min readJul 26, 2019

*This post is a summary of the presentation I did in XConf 2019.

I think we have evolved a lot machine learning in the last years but I guess this is a good moment to stop and think about what is the best way to do continuous deployment in this area and which are the best practices to do it.

A little bit of context

To understand the topic we need to know what is CD and what is machine learning.

Continuous deployment (CD)

It is a software strategy for doing automatic deployments in production when we commit new changes in our code. Everybody wants it because it permits us to have new features on the fly, best quality in our code, faster development and above all, with continuous deployment, we can have more experimentation. In summary more innovation, because if we have a new feature today, why would we wait time for deploying it?

So… we want to reduce the gap between a new idea and when this idea is in production.

Machine learning

First of all, machine learning is not only hype. It is something very useful and we can see it everywhere. Basically, machine learning is a subset of Artificial Intelligence in which we use past data for predicting future data. In a nutshell, ML are statistical models that systems use to effectively perform a specific task and the main difference with the traditional algorithms is that it doesn´t use explicit instructions, relying on patterns and inference instead.

Then, why do we want to achieve CD in Machine learning? I think for the same reason above.

So… we want to reduce the gap between a new idea (in machine learning) and when this idea is in production (in a machine learning system).

Ok, we understand why do we want continuous deployment in machine learning but…

How can we achieve it?

Two years ago, I was working in a big data company and I started a side project (using ML obviously) and I remember that I thought about achieving continuous deployment was impossible. I thought it because I could not understand which kind of test I needed.

Then, I remember when in 2017 some people from Google put this paper and I remember the day because I read it and, although it is not really a paper about continuous deployment, it’s a paper about tests in machine learning.

And in the second slide of the paper, there were two images:

First of all, we can see that the second one is much more complex, and this is mainly because we have two new parts in ML systems compared with traditional systems: data part and model part, and all the kinds of tests we need around these two parts.

This paper has inspired me a lot to think and share with you all the conclusions and ideas I know so far to achieve CD. Let’s see them.

Machine learning systems

This picture is only an overview of the different parts of a machine learning system. This simple schema will serve as a structure to explain the different pipelines.

Code

Code is always code. This is the key. Whenever you are coding, you should always apply the best practices described in XP (TDD, Pair programming, etc). Because in machine learning systems we have not only models, data and other fancy algorithms, we are working in very complex systems with a lot of stuff, so we need coding in services, APIs and so on…

Also, it is very interesting to think about putting quality gates in our pipeline because we don´t want to put in production code that is not healthy.

In CD is very important to understand that we are going to put every commit in production, so we need to ensure that all our code is “production-ready”, so probably you need to use feature toggles (aka feature flags).

Code pipeline

This picture is only an example of a code pipeline. Obviously, depending on our project we need more or fewer steps. We can see here three different parts: continuous integration, continuous delivery, and continuous deployment. Usually, we want to achieve the last one but we can not always do it.

Before starting the data part, I want to put a quote from the paper I talked about before:

Unlike in traditional software systems, the “behavior of ML systems is not specified directly in code but is learned from data”.

This is so important because in traditional systems we can ensure the quality of our code only with more code but now our tests depend on the sets of data for training models.

Data

Before start talking about data pipelines, I just wanted to say here that nowadays the companies are changing the mindset, and now we try to have a model closer to DDD. Because of this, we are trying to move from a monolithic data lake to a distributed data mesh and probably we will have more than one data pipeline in your system.

Data pipeline

This is only an example of a data pipeline, which I am going to explain, step by step.

Ingestion

We live in a data world, so we need to take these data from different sources and feed our system with them. The place where we are going to put all these data, in a raw way, is named data lake.

And yes, we live in a data world and we have a lot of sources and even more, for this reason, we need to know and understand the sources. To do so, the best way is a data catalog.

Another important thing is the governance of your data. You need to have a schema for each source you have. A plain example: if you have a source that produces ages of people, you need to have in your schema that this source produces an integer number between 1 and… 150? 👵🏻

For finishing the ingestion, it is so important to watch for silent failures. If a source stops producing new data, you have to realize about this. Maybe your models depend on this data in a hight percent.

Data wrangling (aka data munging)

Ok, we have our data in a data lake in a raw state, but we want to have them in a production-ready state. And the place where we are going to put the data is a data mart. We don´t want a data warehouse because like we have talked before, we are in a DDD mindset so we will have many data marts instead of the classic architecture.

But wait… be careful, please. If we are transforming our data, maybe we are coding, so use all the XP practices you know also here. Remember, code is always code, and if your data are bad after data cooking, everything will be bad.

Furthermore, in this step maybe you want to do some data cleaning. Be careful here too, and know your data better than anyone because you are going to decide what do you want to do in this cleaning.

Get training data

We live in a big data world, so we have tons of data. And actually, we haven´t got resources for working with this amount of data in our computer and neither in the training infrastructure. Because of this and because of we ❤️ data scientists and we want to make their life easier, we need to cut the data into smaller slices. Maybe you are thinking that if we don´t train our model with all the data probably we will have errors and strange behaviors in production but we talk about how to fix it later.
Yes, we have big data and we need to cut it, but we can´t do this in any way, we have to do this in a statistical proper way (importance-weight sampled). Maybe is a too much simple example, but if you want to summarize a one hundred chapters book you can not cut from the chapter one to the chapter five because probably will be some important information in another part of the book.
We take care of our clients, so we have to take care of their data, so we need a lot of data security. It is important to anonymize all sensitive data if we want to work with them, and of course, use all the protocol of security in our system like SSL, encrypt…
Shit happens… sorry, but I would have to say that to you. And when it happens you will want to have a snapshot of the same data with which the strange behavior or error has occurred. So, the best strategy is data versioning, and save some snapshot along the time for reproducing errors in your testing environments.

For finishing the data part I just wanted to remember the problem with Training/Serving skew. Basically, in data, this makes sense because if you have stale data for training your models, when these models arrive in production they are going to fail. Remember that time is so important because we live in a fast world and data is changing so fast.

“All models are wrong”. Common aphorism in Statistics maybe from centuries ago.
”All models are wrong, some are useful”. George Box (1978).
”All models are wrong, some are useful for a short period of time”. Tensorflow´s team (13th June 2019).

Model

Before start talking about the model pipeline, we need to understand some important keys in the development of new ML models.

First of all

Design & evaluate the reward function. This is not only a task for data scientists or for the engineering team. This is a task that involves the whole organization. If we don´t know what is the objective of our model, it is never going to run properly.
Define errors & failure. Maybe it is easy to understand what is an error, but what is a failure? You need to define it before starting your development. A prediction of 1€, is it a failure? Maybe it is if you are predicting the price of a house in Spain, but what happens if you are predicting the price of a pen?
Ensure mechanisms for user feedback. As I said before, shit happens, and those moments are the best time for giving to the user a place for feedback. And not always when an error occurs, maybe your model has not enough data, and this is also a perfect moment for feedback.
Try to tie model changes to a clear metric of the subjective user experience. Do not forget what your goal is, and everything is going to be easier and better if you link this metric to the user experience. Eventually, we want to improve the user experience.
Objective vs many metrics. Yes, I understand we need a lot of metrics for understanding if our model is running properly but we are repeating that you need only one objective for understanding if this model is good, bad or better than the previous one. And this objective must be measurable.

Model pipeline

Now, it is time for the model pipeline. I am going to explain it step by step, but before starting it is important to understand something: probably, today your data scientists are using a framework, but next month they are going to want to try another one, so your pipeline should be the same no matter the framework you use.

Code a new model candidate

Again coding… I’m being very repetitive but remember apply always the best practices when you are coding, here too. Code is always code. In this step basically, we are going to code new changes in our model and the idea is to commit in our CVS directly our phyton file (with their tests) and the first step in this pipeline is running the test and stop the pipeline if some of them fail.

After that, we can already package our code and we will have a new version of our model in our repository of models. For this propose we can use a tool like DVC.

Training model

Maybe, this is the more complex step. Here we have three different parts:

Feature engineering. Here is where our system is going to decide which features our model is going to use in its training. Here too is where we are going to put some techniques for fixing our data like unbalancing data, unknown unknowns, and so on. Never forget to be critical with your features, data dependencies cost more than code dependencies. In summary, it is a trade-off between using all the features and the cost in the infrastructure in terms of time, resources and money.
Training. Here is where your model is trained. Here I just wanted to say that we have to be careful again with training/serving skew. If you train your model today but you put your model in production in a long time, probably your model is going to fail in production, so try to reduce this gap. I understand that it is difficult but if you can have a deterministic training it dramatically simplifies your system.
Autotune of hyperparameters. There are some algorithms for doing it, but most of them, they are running the training several times for determining which are the best hyperparameters.

Model competition

I called this step model competition because here we are going to guess which model is the best.

The idea is to send production data in a shadow traffic way to the new model, the model which is in production and the other models we have in our model repository. Notice that only the production model is returning data to the production system.

How we are going to compare it? It is, definitely, the most difficult question to answer here. It depends on the kind of problem. If you have a simple classification model you can compare the prediction with reality. But what can we do for instance in a recommendation model? Perhaps you could get 80% of products in your orders and try to predict the 20% of the rest of the products the client has ordered. This is only an example and an idea, the key here is understanding, as I said before, that we need to have the reward function of our model before testing it. But we will talk about this in the near future.

Model performance

Ok, so we are going to test the performance of the models with production data and compare the models in terms of the reward function and failures. A lot of teams use the ROC curve for instance.

You need to be careful, maybe your model is running properly apparently and maybe your model is having a 90% of accuracy for instance but, what happens if in a data slice your model´s accuracy is 30%? At least we must take it into account. Be aware of this and decide what do you want to do in these cases.

My advice? Have a baseline of accuracy and if in one data slice, your model is not passing it, this model is not good enough.

Also, as you are receiving production data, use the feedback loop to improve the model.

Model champion

And with this comparison of the performances, we have the model champion. The best one. Maybe this champion could be an old model because data is changing and past poor models now could be good models.

Deploy the champion model

We can not put this super cool and fancy model in production and running out.

We need to apply a shadow traffic strategy because we are trained the model with past data and we are testing with production data, but we never know how our model is going to perform in the future. So, in this shadow traffic way, we can analyze the logs without affecting our clients.

There are other techniques we need to know and use (like canary releases and a/b tests) but above all, we need to define a protocol for rollbacks because everything is a candidate to fail.

Monitoring

…because shit happens

In continuous deployment, we have short releases and fast developments but it’s equally important to know if our systems are running properly than to be the first ones to realize that something is going wrong.

Obviously, we need to monitor our model. For that, there are some important things:

Create a dashboard with clear and useful information. This is the main point. We are doing complex systems so we need to have a dashboard only with the most important information to understand easily if our systems are running properly.
Schema changes. We have talked about this before. If the sources are changing the data, we need an alert for realizing about it.
Infra monitoring. Training speed, serving latency, RAM usage, etc…
User feedback. Sometimes we forget monitoring if something is wrong when the system is running the feedback loop to the models.
Stale models. Data is changing very fast in this fast world, and if data has changed a lot we need to realize about it. There are some statistical ways to be aware of this, but basically, we can monitor if the data distribution has changed in a statistical importance way.
Errors. If our model is failing, obviously we need to be aware of this, but we have too other services, API, and other stuff we need to monitor.
Data pipeline. Sometimes we have errors in our data pipeline and we realize about this so time later. We need logs in every step of the pipeline.
Silent failures. If a source stops producing new data, we need to be aware of this early because maybe our models depend on this data in a hight percent.

Conclusions

This brings to me to the end of the post. I just wanted to put here some conclusions for summarizing the post:

Code is always code. Use XP practices.
Objective driven modeling. Have always a clear objective before starting.
Know your data. Stale data, schema, governance, data catalog…
Clear metrics for complex systems.