The 10 Most Common Challenges Converting Machine Learning Models into Production

Forewarned is Forearmed

6 min readJul 16, 2021

Some practitioners observe that few machine-learning models make it off the data scientists’ workbench and scale into production. Here, it would mean seamlessly integrating code units so that it automatically accepts input, runs, and produces output. For those who appreciate root-cause analysis to solve why this happens, so it doesn’t happen to them, there are several contributing factors (Gonfalonieri, Why is machine learning deployment hard, 2019). Moreover, even when ML models do get integrated and scaled into production environments, they require continuous maintenance and in ways that are in contravention to how many software production environments evolve. The following sections aim to list and describe the 10 most common challenges in converting ML models from development into production.

Data Science > Software Engineering Transition

One, deploying an ML model into production at scale is more software engineering and application development than with which many data scientists are comfortable. Remember, data science is the confluence of software development, mathematics/statistics, and subject-matter expertise. This means data scientists will have differing levels of knowledge in software engineering, ranging from none to moderate. Conversely, software engineers who develop applications and production architectures are not inherently data scientists. They often lack the knowledge to understand how ML applications work and what idiosyncrasies they have. As ML models increase in complexity via ensembles and automated ML, the need to bridge a gap between the science and engineering becomes broader and more challenging.

Programming Language Inconsistencies

Two, there’s a disconnect between the programming languages and tools of data science and production environments. Arguably, the two most common data-science languages are R and Python (there’s also SAS, Java, Julia, Scala, and a whole family of probabilistic programming languages (PPL) such as Edward, Gen, Infer.net (Microsoft), and Pyro) (Rodriguez, 2019). Because R is slow processing big-data and mission-specific, it rarely belongs in a scaled production environment. When Python or R models (or other languages for that matter) are ported to whatever language(s) the legacy production environment uses (e.g., Java, C++, etc.) it can be error-prone, slow, and tedious (Gonfalonieri, 2019).

Version Controls

Three, most ML models are built with libraries or packages, which evolve through different versions. This can mean the models only work as designed in the library or package versions in which they were developed, tested, and proven. Tracking these version dependencies through a production stack, and updating them when needed, is a challenge. Moreover, there may be a mathematical reason to use multiple versions of a library for various models, making it more challenging to reproduce them in production because of increased complexity of version dependencies (Gonfalonieri, 2019).

Inadequate Processing Capacity

Four, ML models are processor-intensive in scaled production environments, which is also expressed as ‘computationally expensive.’ In other words, they use a high quantity of computing power and often use graphical processing units (GPUs) in parallel with central processing units (CPUs). The challenge with production at scale is that many production environments lack this processing capacity (and GPUs). Therefore, to implement ML models at scales usually requires a complete re-think and redesign of the production environment in terms of server and processing capacities (and associated increased costs) (Gonfalonieri, 2019).

Scalability Challenges

Five, ML models moving from development to production environments often encounter scalability issues. ML models are typically developed and tested in smaller and slower velocity environments that scaled production (Gonfalonieri, 2019). Therefore, they may not be ready, or even work, in bigger and/or faster-moving data sets. Not only can this create issues with accuracy or precision — for example, over-fitting on a bigger data set that didn’t occur on a moderate data set — it often impacts performance. The impact on processing performance from a neural network or deep-learning network, for example, and the necessary 100x or 1000x times more data, is exponential, not linear. Moreover, if there is a need to process more data at a higher velocity, exponentially longer processing and compute times is doubly problematic.

Decentralized Data & Definitions

Six, ML model application into scaled production is challenging because of data silos. Reporting from data warehouses greatly improved access to holistic data in the 1990s. However, data silos are still alive and well. Data silos occur at functional department or division levels (e.g., sales, customer service, imaging, billing, etc.), geographic levels (e.g., operations in different countries), time levels (e.g., recent data vs. legacy data), and because of system incompatibilities (e.g., not moving legacy data because it’s in a system that’s incompatible with the current production system, or different data dictionaries were used). Even if data repositories can be centralized or virtualized, data streams must be too.

Dynamic Inputs Break Machine Learning

Seven, data inputs are always evolving in production environments — both in the number and granularity of data that is captured, and its all-important definition of a field (Gonfalonieri, 2019). This is problematic for ML models because they are carefully developed and tested with a fixed set of fields in a data set with specific definitions. When data inputs change, the models need to be re-tested, often re-built, and re-validated. This is a develop-train-test cycle that can last from days to months every time a data input is changed.

Conversely, production environments at scale cannot evolve their data inputs or definitions and continue to apply the older ML models. Because of the complexity of neural networks, deep learning, ensembles, and automated ML, it is not human feasibly to predict what will happen if data input changes. A system needs to be put into place that tracks all data input changes, and the associated changes in ML models (e.g., model type(s), weights, etc.).

New Quality Assurance Types

Eight, ML models require a separate and more complex quality assurance (QA) process than is in place in most organizations’ production environments (Gonfalonieri, 2019). Traditional QA, oversimplified, tests if the code runs and the outputs are correct. However, in ML, it’s difficult to know if the outputs are correct because they are complex and predictive. Specialized statistical testing is used in ML model development that is not used in traditional software development QA.

Degrading ML models require continuous maintenance

Nine, arguably, one of the two dirty secrets of machine learning (the first being inexplicability) is they begin degrading almost as soon as they’re put into use, and therefore, require continuous revision. This is a surprise to many practitioners because they are habituated to the fact that traditional software applications keep working for a long time after they’re built. In economic terms, the marginal cost of software is typically zero, which has been a profoundly important part of software’s business model and ROI for 35 years. It has enabled consumers of software development to build it once, and use it a long time, because there’s no additional cost every time it runs. That’s not true for machine learning. It has a marginal cost because the models degrade — sometimes within as quickly as weeks after months spent developing and implementing them. In better cases, ML models’ shelf-life of accuracy is several months. And, while the academic origins of machine learning are conscious of this problem and have studied it, it’s often unknown, forgotten, or neglected in commercial production environments to the owners’ detriment. Even when data-science practitioners are aware of the possibility of degradation, they don’t know or can’t often predict how fast it will occur (Talby, Why machine learning models crash and burn in production, 2019).

ML model degradation comes in two forms: concept drift and generalization. Concept drift occurs because the data inputs to ML models are dynamic in the real world — meaning the data changes. This is relevant because ML models are trained and tested on one set of data; the more their data inputs change over time, the less accurate the models become. For example, weather, time of year or seasonality, and evolving consumer thinking or tastes change what they desire, when, how much, and why. Therefore, the causal elements that predict whether a consumer will behave a certain way evolve. The more they evolve, which is usually a function of time, the less accurate older ML models become because, the more virtual distance there is between their training data and what’s occurring in the real world. Simply put, the ML models are trying to make predictions based on a real-world model that no longer exists. Whereas generalization is closely linked to over-fitting. It refers to an ML model’s ability to adapt to new data on which it was neither trained nor tested (Gonfalonieri, 2019).

Accuracy Validation is Difficult

Ten, validating the accuracy of ML models in production is notoriously difficult. Inaccurate ML models have caused reputation loss by making racist or sexist recommendations. They can also misinform decisions that cost their users millions of dollars in costs or opportunity costs of lost profits or revenues (Kohavi, 2012). Awareness of the most common, the tradeoffs between accuracy and precision, and how to prevent them is a team-sized specialization in and of itself.

The 10 Most Common Challenges Converting Machine Learning Models into Production

Forewarned is Forearmed

Written by Eric Luellen