THE HIDDEN COSTS OF BRINGING MODELS TO PRODUCTION

TABLE OF CONTENT

Prevision.io

Published in

Prevision.io

8 min readDec 2, 2021

Modeling is not the final horizon, brace yourselves for production
When changing anything changes everything (Experiment Tracking)
You better get prepared for data changes (Monitoring & Alerting)
Models get old — keep an eye on them (Feedback Loop & A/B Testing)
Data preparation and “the kitchen sink” (Pipeline & Scheduler)
What makes a great machine learning platform for production

Modeling is not the final horizon, brace yourselves for production

As data-science practitioners we eat machine learning for breakfast. On a day-to-day basis, we’re dealing with dozens of packages and libraries to get the most of the data we have. But do we really believe that consistently improving a model’s accuracy is our final horizon? Of course not! Getting into production and staying in production should be our mission.

But argh, is it hard if you’re not well prepared: Besides all of the particular ways a machine learning project can fail — because of crappy data or badly defined objectives (a subject that in itself could fill an entire encyclopedia) — in this white paper, we’re revealing hidden costs. That is, all of those unseen efforts you have to deal with along the machine learning journey to get your model into production. Whether you’re an OG (old guy) data scientist like myself, just getting started, or somewhere in between, we all can benefit from looking more closely at the icebergs that can bring your project to a screeching halt.

When changing anything changes everything (Experiment Tracking)

As you know, a machine learning model is not a piece of code you write explicitly. It’s a very unique piece of code derived from data. The piece of code — namely, the machine learning model — is conditioned on many variables with very different origins

This diversity, of course, stems from the training data (read more about this issue in the section on getting prepared for data changes below). But it also grows out of the training procedure itself that’s defined by the training algorithm with its own hyperparameters. Even though some of these procedures are stochastic by nature, it happens in the same way that your favorite random number generator can influence a final, produced machine learning model. To an experienced software developer, it might sound crazy to rely on a learning procedure that never outputs the same result twice. It’s sad but true: Consistently achieving the same results with machine learning procedures is a challenge for everybody in the field.

Where so many variables can change, the best option is to track every change in a systematic way. If you consider going and maintaining your machine learning model in production, you better rely on a very consistent experiment-tracking platform.

So you might think that Git-like version control software is the silver bullet to track all changes. You might think, how wrong could it be? The thing is, you can’t tell. How many times have you been in a situation where your notebook suddenly doesn’t behave the same way it did six months ago? Thinking about what might have changed, you’ll start to question what went wrong:

Is it because some of your favorite packages have been upgraded?
Did you forget to fix the random state of any stochastic procedures?
Did you simply fail to save some of the required environment variables?

There are so many ways to fail to reproduce a machine learning model in the wild.

The best advice here is to rely on a dedicated platform to track all these external changes. This is even more true when you consider that inevitable team turnover will at some point force you to maintain someone else’s machine learning models. So if you want to escape from a future full of painful model maintenance issues, use tools to ensure model reproducibility from the very start of your journey.

You better get prepared for data changes (Monitoring & Alerting)

During the modeling phase of a machine learning project, we as data practitioners typically don’t pay much attention to data changes. Because during this very creative phase, the objective is to experiment with different ways to represent data. We try daring feature engineering techniques, timorous data aggregation, or bold data enrichment, consistently comparing results to see if any of this improves or not. Here we won’t talk about the ideal recipe to establish a consistent validation procedure (this too could get an encyclopedia volume of its own). Instead, we’ll focus on what could — and certainly will — happen once your model validation job is done and after you get it into production.

Here the million- or billion-dollar question is not: Will the data go wrong or not? Instead it’s: How do I react when data does go wrong? Because, without a doubt, data will go wrong, I assure you!

There could be so many paths to failure that we propose to categorize them in two different buckets depending on whether the problem is between the keyboard and the chair — or from the real world.

The first category of failure? Cumbersome data issues found in production — badly formatted data, out-of-range data, or simply missing data. What they often have in common is that the root cause stems from an external “improvement” to the IT system of the company. To illustrate, a classic and annoying example is when the IT guy next door decides to rename some data features. As a human being, you can probably accept a passengerId variable to be renamed into passenged_id as an improvement, and implicitly adapt to it. But your machine learning model won’t be so flexible and it might silently fall back to a missing data case. Another classic example from the trenches is an implicit change of units in the data. When your model was trained with a U.S. datetime format, and then used in production with a European one instead, then months turn into days and vice versa — guaranteed chaos. Unless your model fails silently, every time you encounter these kinds of issues, you should be able to put a finger on a particular decision someone made that cascaded to your fragile model in production.

The second category — data distribution changes from the real world, aka data drift — is a bit more interesting to cope with as a data science practitioner. (We elaborate on the concept of drift in the feedback-loop section below). Data drift is the natural evolution of feature distribution seen as a random variable. For example, the average age of a fixed population increases as the population gets older. If a model has been trained a long time ago with age as a feature on a younger population, it might have difficulties accurately predicting age on an older population. Drift is generally a good signal to consider triggering a model retraining procedure.

There’s nothing that can protect you from these data-dependency failures because of the very specific nature of machine learning models. But if you’re aware of them, you can monitor them and have alerting mechanisms in place. That way, as soon as issues happen in production, you can quickly implement mitigation or fallback techniques.

Models get old — keep an eye on them (Feedback Loop & A/B Testing)

Most of the time, a supervised machine learning procedure is not an online learning procedure. Even if you’re very proud of a model’s accuracy at the time you release it into production, it’s very likely to get worse over time unless you refresh your model periodically.

So how do you detect things going wrong? The answer is simple — but implementation is complex. To mitigate the risk of model decay, you simply need to check your model’s performance on a recurrent basis. In data science jargon, you need to compare actuals (real-world ground truth) and predictions (model outputs). The tricky part is that in most cases, the actuals can only be known a long time after the prediction has been made.

For instance, let’s look at the extreme but common situation of a credit-scoring model that outputs the probability of a given loan to be repaid. Clearly in this situation, you’ll ultimately get the ground truth when the loan is either fully repaid — or not. And this might take place years after you’ve granted the loan. Even worse, you’ll only get the ground truth for the loan you’ve granted so far. In this situation, you need to establish A/B testing strategies. A/B testing can help ensure that some not-so-bad-but-not-so-good loans are tested in the wild — and that you have a persistent feedback loop to collect this information long after the initial prediction was made.

Feedback loop and A/B testing strategies are the way to go to mitigate prediction breakdown. A vivid example of this is the sepsis model from the electronic medical records vendor EPIC, which was widely used by the health industry for two years. Before that, an independent survey detected the model was flawed in the real world — and that turned out to be true. This could have been monitored and remedied sooner if solid feedback loops have been put in place.

Data preparation and “the kitchen sink” (Pipeline & Scheduler)

Before feeding a machine learning model with data, we all know that the data must be prepared. From data cleaning to data enrichment to data conversion, a lot of our daily effort resides in the data transformation that we need to put in place before a model can predict anything. In fact, a common joke among data science practitioners is that 90 percent of our job is about data preparation.

We won’t argue whether this is perfectly accurate or not, but what remains is the absolute need for a machine learning model to be shipped alongside with its own data transformation. Here’s why you should, very early in your modeling process, consider distinguishing what belongs to the model itself and what belongs to the data preparation.

Once you’ve separated the model part and the data preparation part, you’ll need to rely on a pipeline infrastructure to execute the data transformation ahead of model prediction in a consistent manner. In addition to a decent pipeline infrastructure, you’ll also need a scheduler for your peace of mind. No one wants to resurrect an ugly notebook every morning to get the daily predictions up and running when the model is in production.

That’s why successful data science practitioners design pipelines and use scheduling capabilities to get the job done in an automated way. So forget about the data preparation “kitchen sink” notebooks — and go for a clean and consistent pipeline to host your model and transformation together in production.

What makes a great machine learning platform for production

If you’ve successfully put a model into production, you know that the modeling is just the tip of the iceberg. Indeed, modeling is the most visible part of our job — the part your company cares about because it delivers real, tangible value.

But you know that to support your precious model in production, a lot of work has yet to be done. We detailed the need for experiment tracking, monitoring and alerting, feedback loop and A/B testing, and pipelines and scheduling.

And we have good news for you. At Prevision.io, we’ve done this journey many times for our customers, and we’re dedicated to building the most compelling AI management platform. Using our platform enables you to deploy and maintain models in production, so that you can focus on what really matters to data science practitioners: Collecting the data, defining the objective, building the model, and letting the platform service the model for you.

Do what you do best. Apply your talents to solving business problems while we take care of the technical grunt work in building, deploying, and monitoring your high-performing models. If you would like to try for yourself, please visit us at www.prevision.io to learn more.

Written by Nicolas Gaude, Co-founder at Prevision.io

THE HIDDEN COSTS OF BRINGING MODELS TO PRODUCTION

TABLE OF CONTENT

Written by Prevision.io