DIY Fail: Taking on Model Deployment Alone

Many of us who work in software know the panic on the other end of that 3am wake-up call: a production system is down. If you don’t know that feeling, just ask your DevOps or IT guy and watch them sweat. When different micro-services work together to deliver a product, things break. This contrasts with how Data Scientists approach our modeling work.

Data exploration and model development are often conducted in an environment that is very different from “production”. Small samples of data can often reside on a laptop, with no need for extra computing resources. Even when data is accessed remotely, exploratory analysis and hypothesis testing rarely require someone on “pager-duty” 24/7. And many of the most challenging problems do not arise until a data science team tries to deploy a machine learning pipeline into production — at which point progress slows, reliance on software and IT increases, and economic value stagnates.

Businesses expect more and more of their data science teams. As businesses’ expectations grow and mature, data science teams need to provide more than ad-hoc modeling and analysis. Companies are beginning to realize that they need robust, maintainable, automated deployment pipelines with up-times that rival operational data stores. What is the pathway from proof-of-concept to a production pipeline of tasks including data ingestion, cleaning, engineering, model training, and deployment? Many companies are currently addressing this challenge head-on, and are asking themselves: should I build it or should I buy it?

If you’re a Data Scientist-Engineer hybrid like me, you might jump at the opportunity to rally the team and build an “in house” solution. But deploying machine learning solutions to a stable, scalable, and enterprise-grade production environment is hard. You will quickly discover some difficult truths:

1. Local machine learning is very different than machine learning in production.

A model prototyped in a local environment has a different set of dependencies when it gets to production. Whether the production environment is hosted on the cloud, or on a cluster of physical servers sitting in a chilled-closet down the hall, some new questions must be considered:

  • Data Storage & Scale: Where is the data coming from and going to? How does the model get access to it? Is there enough compute resources to go around?
  • Scheduling: How often does the job(s) need to run? When does the model need to train or score? How are the processes dependent on one another?
  • Output: The whole reason a model goes into production is to drive business insights/decisions. What application, software, report, html page, or mobile app consumes the output of the model(s)? How is it going to get there?
  • Backend Infrastructure: Now that the model isn’t an ad hoc experiment anymore, what infrastructure is in place to make sure that it’s resilient to network issues, node failures or corrupted data input?

2. Maintenance costs dominate ML pipeline expenses in production.

Did you accidentally leave an instance or two running while you went on vacation? That could be a costly mistake! Several cloud services and deployment platforms are not forgiving in their variable pricing plans. Lack of visibility on month over month expenses is a hard sell to business leaders of any organization.

Infrastructure costs and man hours alone are huge maintenance expenses. According to a report published by McKinsey Global Institute (“Artificial Intelligence The Next Digital Frontier” June 2016), $30 billion was spent in 2016 alone on AI R&D and building deployment solutions. Keeping the data pipes clean for machine learning is a challenging job, requiring a full stack: data store, reliable queues, server side communication, websocket connections and a colony of clusters — all requiring configuration and orchestration. Contracts between these micro-services will be breached, nodes will fail and those 3am alerts will start pouring in (if they haven’t already!).

3. Model performance degrades over time.

As the world changes around us, assumptions integrated into our machine learning models break. Especially in systems where a feedback loop is generated, data inputs may reflect a different world than they did when the model was first built. Sometimes a model breaks because an external data API is no longer available. Sometimes they die because a source of bias crept its way into the training data. When a model is in production, it must be carefully monitored to ensure that any real-world consequences from model predictions are minimized. This would be especially costly in the industrial space, where severe errors in predicting machinery failures can cost hundreds of thousands of dollars, if not millions over time.

4. A typical Data Scientist is not an expert in cloud infrastructure or backend engineering.

As a Data Scientist myself, I will be the first to admit that our breed doesn’t speak the same language as Software Engineers. The development lifecycle for model building also differs from building a scalable microservice. This is understandable, because there are fundamentally different tools, languages, concepts and environments used for making things work. However, when it comes to deploying end-to-end machine learning pipelines that deliver prescriptive insights, gaps in understanding and communication can cause severe delays in realizing value.

At Metis Machine, we’ve witnessed the horror stories of model monitoring and deployment with customers in many domains. Frequently, we encounter young and energetic organizations who have the optimism to take over the world. However, if they haven’t stepped into the world of ML deployment, they haven’t been seasoned by the war yet.

With Skafos, friction is eliminated; our platform enables teams of data scientists to drastically speed up the time to market by providing tools and workflows that are familiar and easy. Serverless ML production deployment is as simple as “git push”. That time saved may be the difference-maker in a competitive tech industry.

We have the battle scars from wrestling infrastructure, smoothing out the deployment pipeline and moving through stages of monitoring maturity internally. Those insights make it right into our product, right into the hands of our user community and, thus, making Skafos a better platform for deploying and managing machine learning pipelines. So, before you decide to build it yourself, remember this: no one wants to be woken up at 3am.

Like what you read? Give Tyler Hutcherson a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.