Why You Can’t Get Your ML Models into Production

David Hershey
Published in
9 min readNov 2, 2020


Many teams are running into the same problem with Machine Learning: they build ML prototypes that promise to revolutionize the way they do business, but hit roadblock after roadblock trying to move those models into production. In many places, this causes ML to lose some of its luster — ML promising to save you $1,000,000 sounds great until it costs you $1,000,000 to get a model into production. What is going on at these organizations, and how can we fix it?

Science, Prototyping and Production — a case study

A large part of what is still broken for most ML shops is a lack of process for moving from pure research to prototyping to production. In non-software industries, this process is well developed and understood — to better understand the difference let’s consider the problem of battery development for electric vehicles.

Science: Pure science happens in both academia and industry, working on everything from chemistry to mechanical design. These breakthroughs will eventually lead to batteries that have a higher energy density or lower weight, and your car company will want to test out this new technology to see if it can extend the range of your vehicles.

Prototyping: Prototyping is the key step transforming new ideas into business value. For something like a battery, potentially years of development may occur before the technology is ready for production (luckily, software like ML tends to be faster). The prototyping phase starts with extensive analysis to ensure it is feasible for the new technology to make its way into your product. You’ll build prototypes, iterate on them, and modify the form-factor to more closely resemble production. At some point you’ll decide the new technology is worthwhile, and you’ll begin planning how to integrate the new battery into a real-world vehicle. You’ll outfit your factories with new tooling, and design your next car with the new battery in mind.

Production: Once all of this groundwork has been laid, you’ll start building cars with the new battery. You’ve labored over the details in prototyping and spent plenty of time designing the tools to integrate into your new car. You’ll hit road bumps, but you’ll have the right processes in place to address those as you go.

What Prototyping Looks Like in ML Today:

In machine learning today, all of these same phases exist, but the links between them end up looking pretty broken. Let’s step through it:

Science: In both academia and industry, pure research on ML is advancing more rapidly than ever. New model architectures are designed, new loss functions improve performance of some models, and new algorithms for training improve how accessible those models are to downstream consumers. Just like in the battery example, ML research is often conducted without a production use case in mind — there’s no need to worry if your research code can scale since your main goal is to learn new information about ML.

The Problem: none — until you get to the next step!

Prototyping: In ML, research breakthroughs seem incredibly accessible compared to other fields of science and engineering. Research code is often published and made available for free, meaning users in industry are empowered to download a new model, plug in their dataset, and start to see results right away. In this way, prototyping is often vastly accelerated compared to hardware development, and results can be seen and promised much more quickly. Even then, just reproducing research results can take weeks and lots of GPU hours — deep learning is still not easy.

The Problem: you’ve been lulled into a false sense of security, because you just copied research code instead of building from the ground up with a path to production in mind!

Production: In many early-stage ML teams, the first pass at production takes the form of trying to morph your prototype into a production system! Even worse, that prototype was often copied from pure research code that was never intended to be used as part of a product! Now you run into a litany of issues — code that isn’t performant, doesn’t run on production hardware, a lack of clarity of how to get data to your model in production, and ad-hoc processes for each new model you generate.

The Problem(s): (1) Your prototyping process isn’t done with production in mind and (2) you didn’t spend the time to build a factory before you tried to sell your new car!

If you’ve found yourself facing these problems, you’re not alone. ML is evolving faster than nearly any other field with as large and broad of an impact on the world — very few have figured out how to do this at scale.

The Solution: A Sustainable Research to Production Process

The solution to lower the cost of moving ML models into production is twofold: (1) process and (2) tooling.

Model Development Process

If you want to build models that will transition smoothly to production, you need to start the prototyping process with production in mind. You’re not just building a pedestrian detection model — you’re building a pedestrian detection model that needs to run on-vehicle. You’re not just building a ride-share pricing model — you’re building a ride-share pricing model that needs to respond within 100 ms.

What does that mean? Some thoughts:

  • Be careful when adapting and reusing model code, particularly research code. Make sure that you either find an implementation that will support your production use case, or factor in time to re-implement your model in production.
  • Consider establishing a standard format for all of your ML models so you can build generic tooling around them.
  • You’ll need to take excellent notes while you prototype — keep track of every model iteration, every hyperparameter configuration, metrics, artifacts, dataset versions. More on this in the tooling section.
  • Remember that you need to keep track of all of the data preprocessing you do — you’ll need to recreate those exact transformations to run in production.

Supporting Tools

Asking you to juggle all of the above processes by hand is too difficult — you’ll get bogged down keeping track of your work and never actually make progress developing your model. Luckily, a variety of tools exist to complement your existing ML workflows and help you build models that are ready for production.

Production-ready Code: Ensuring your code is “production-ready” depends on what production looks like for you, but there are some constants that will make life easier for you:

  • Expose crucial sections of your code as standard APIs. Things like creating your model, loading pretrained weights, and conducting a forward pass should all be standard across models to ensure the code is easy to re-use in production.
  • If your production environment requires a programming language that isn’t Python, have a plan from the beginning. Whether that is converting your model to ONNX or using native APIs for PyTorch TensorFlow, or XGBoost, you’ll want to make sure whatever model you use has a path to production.

There exist some great tools out there that make it easy to standardize your ML code, and provide some great benefits for doing so. I work at Determined AI, so I’m a little biased, but Determined is one of the premier tools for productionalizing training code. Determined is an open source platform that has a standardized ML code format — using this format will unlock simple distributed training, training on the cloud, state-of-the-art hyperparameter tuning, and automatic experiment tracking. Further, Determined offers a model registry, which provides you with a place to store your models for consumption in production — one of the easiest ways to guarantee an easy transition from research to production.

For a lighter-weight solution, PyTorch Lightning is an easy tool to provide you with clear model structure (if you’re using PyTorch). Although it doesn’t offer as many features as Determined, it will provide you with some simple training APIs so you don’t need to write as much boilerplate code (training loops, checkpointing, etc.)

Tracking Your Experiments: Tracking experiments isn’t just a best practice to help you manage your own work (although it does help with that!), it’s an essential part of the process of moving models to production. When it comes time to promote a trained model to production, you’ll want a clean log of exactly what model you trained (code version), what data that model was trained on (data version), the results of model training (metric tracking), as well as the artifacts associated with that model (artifact management). You should either have a plan to track all of these items for each experiment you run, or you should use a tool to help you out.

There are multiple good options in the experiment tracking space. Determined is extremely full-featured in this space, as it tracks all of the items you’ll need for production automatically — metrics, code version, and artifacts are all managed for you. Further, Determined then exposes all of these metrics with a model registry, making it simple to use those models in production.

MLFlow is perhaps the most popular tool in the experiment tracking space. Also open-source, MLFlow provides a convenient set of APIs that allow you to outfit your code to track parameters, metrics, artifacts and code version for experiments. MLFlow has great tools for experimentation — allowing you to easily compare metrics across multiple experiments which is useful as you develop a model. MLFlow also provides a model registry for tracking artifacts for production, although managing the artifacts for these models is slightly more manual than with Determined.

Weights and Biases is another popular experiment tracking framework that operates similarly to MLFlow, providing you with APIs so that you can outfit your code with the ability to track parameters, metrics, artifacts and code version for experiments. They provide excellent dashboards to visualize experiments, compare multiple experiments, and manage workflows. One caveat (if relevant to you) is that only the Weights and Biases client is open-source, the actual dashboards for tracking are not. That said, you can sign up for a free individual account which should work for most single users.

Managing your Data Preprocessing: One section of version management deserves special attention — transformations on source data. Often overlooked, keeping track of all of the transformations done on raw data to prepare it for modeling is crucial. What this actually looks like depends highly on what type of data you’re working with — for the sake of this post we’ll break this into two categories, structured and unstructured data.

Structured data can actually lead to some of the most unfortunate gaps between research and production. If you’re not careful, all of the data preparation you do on structured data (SQL queries, cleaning, etc.) can be really hard to reproduce in production. There exists a whole genre of tools to address this problem, called Feature Stores. If you’re doing a lot of ML with unstructured data, a feature store might well be a great solution for you. There are a handful of solutions out there, including Tecton, Hopsworks, Feast, and Scribble.

Unstructured Data faces a different set of problems. More of the preprocessing for unstructured data (images, text, audio) is done in Python and therefore tends to get captured by version control, making it slightly easier to recreate in production. In this space the main concerns are around versioning your data (with tools like Pachyderm or DVC) and making sure you know which set of data transformations was used for each trained model — which looks more like an experiment tracking problem.


When moving the latest research breakthroughs in ML to production, you might be tempted to skip a few steps. If you resist that urge by starting early, making a plan for production, and using the right tools, you can reduce that burden and consequently the amount of time you spend making models work in production.

Note that this is only one piece of the puzzle that is production ML, depending on your use case you’ll also need serving infrastructure, monitoring infrastructure, and more. This is a fundamental piece though — without a good process for prototyping you’ll get stuck, no matter what other tools you have.



David Hershey

Investor at Unusual Ventures| Machine Learning Infrastructure Enthusiast