Productionizing Machine Learning Models
Deploying models into production is a complex affair
The biggest issue in the life-cycle of ML project isn’t to create a good algorithm or to generalize the results or to get good predictions or better accuracy. The biggest issue is to put ML systems into production. One of the known truth of Machine Learning world is that only a small part of real-world ML system is composed of ML code and a big part is model deployment, model retraining, maintenance, on-going updates and experiments, auditing, versioning and monitoring. And these steps take a huge part in ML systems technical debt as it exists at the system/platform level rather than the code/development level. Hence the model deployment strategy becomes a very crucial step in designing the ML platform.
Introduction
The first step in determining how to deploy a model is understanding the system with these questions-
- how the end user interacts with the model predictions.
- how frequently you should generate predictions.
- whether predictions should be generated for a single instance or a batch of instances at a time.
- what is the number of applications that will access this model?
- what are the latency requirements of these applications?
It’s indicative of the complexity of machine learning systems that many large technology companies that depend heavily on machine learning have dedicated teams and platforms that focus on building, training, deploying, and maintaining ML models. Here are some examples:
- Databricks has MLFlow
- Google has TensorFlow Extended (TFX)
- Uber has Michelangelo
- Facebook has FBLearner Flow
- Microsoft has AI Lab
- Amazon has Amazon ML
- Airbnb has BigHead
- JPMC has Omni AI
Machine Learning System vs Traditional Software System
1. Unlike Traditional Software Systems, ML systems deployment isn’t the same as deploying a trained ML model as a service. ML systems require a multi-step automated deployment pipeline for retraining, validation, and deployment of models — which adds complexity.
2. Testing an ML system involves model validation, model training, etc — in addition to the software tests such as unit testing and integration testing.
3. Machine Learning Systems are much more dynamic in terms of performance of the system due to varying data profiles and the model has to be retrained/refreshed often which leads to more iterations in the pipeline. This is not the case with Traditional Software Systems.
Model Portability (From Model Development to Production)
Writing code to predict/score data, is most often done in Jupyter notebooks or an IDE. Taking this model development code to the production environment requires converting language-specific code to some exchange format (compressed & serialized) that is language neutral and lightweight. Hence portability of the model is also a key requirement.
Below are the wide use formats for ML model portability-
1. Pickle — The pickle file is the binary version of a Python object which is used for serializing and de-serializing a Python object structure. Conversion of a python object hierarchy into a byte stream is called “pickling”. When the byte stream is converted back to object hierarchy this operation is called “unpickling”.
2. ONNX (Open Neural Network Exchange) — ONNX is an open-source format for machine learning models. ONNX has a common set of operators and file formats to use with models on a variety of frameworks and tools.
3. PMML (The Predictive Model Markup Language) — PMML is an XML-based predictive model interchange format. With PMML, you can develop a model on one system on an application and deploy the model on another system with another application, only by transmitting an XML configuration file.
4. PFA (Portable Format for Analytics) — PFA is an emerging standard for statistical models and data transformation engines. PFA has the ease of portability across different systems and models, pre-processing, and post-processing functions can be chained and built into complex workflows. PFA can be a simple raw data transformation or a sophisticated suite of concurrent data mining models, with a JSON or YAML configuration file.
5. NNEF (Neural Network Exchange Format) — NNEF is useful in reducing the pains in the machine learning deployment process by enabling a rich mix of neural network training tools for applications to be used across a range of devices and platforms.
There are some framework-specific formats as well, like — Spark MLWritable (Spark specific) and POJO / MOJO (H2O.ai specific).
CI/CD in Machine Learning
In traditional software systems, Continuous Integration & Delivery is the approach that provides automation, quality, and discipline for creating a reliable, predictable and repeatable process to release software into production. The same should be applied to ML Systems? Yes, but the process is not simple. The reason is in case of ML systems, changes to ML model and the data used for training also needs to be managed along with the code into the ML delivery process.
So unlike traditional DevOps, MLOps has 2 more steps every time CI/CD runs.
Continuous integration in machine learning means that each time you update your code or data, the machine learning pipeline reruns, which kick off builds and test cases. If all the tests are successful then Continuous Deployment begins that deploy the changes to the environment.
Within ML Systems, there is one more term for MLOps called CT (Continuous Training) which comes into the picture if you need to automate the training process.
Although the market has some reliable tools for ML Ops and new tools are also coming up, it's still new to predict the ML model outcome in the production environment.
New tools like, Gradient and MLflow are becoming popular for building robust CI/CD pipelines in ML systems. Tools such as Quilt, and Pachyderm are leading the way for forward-looking data science/ML workflows but they have not yet had widespread adoption. Some other alternatives include dat, DVC, and gitLFS; but the space is still new and relatively unexplored.
Deployment Strategies
There are many different approaches when it comes to deploying machine learning models into production and an entire book could be written on this topic. In fact, I am not sure if it exists already. The choice of deployment strategy depends totally on the business requirement and how we plan to consume the output prediction. On a very high level, it can be categorized as below-
Batch Prediction
Batch Prediction is the simplest form of machine learning deployment strategy which is used in online competitions and academics. In this strategy, you schedule the predictions to run at a particular time and output them to database/file systems.
Implementation
Below approaches can be used to implement batch predictions-
- The simplest way is to write a program in Python and schedule it using Cron, but it requires extra effort to introduce functionalities for validating, auditing, and monitoring. However, nowadays we have many tools/approaches that can make this task simpler.
- Writing a Spark Batch job and scheduling it in yarn and introducing logging for monitoring and retry functionalities.
- Using tools like Perfect and Airflow which provide UI capabilities for scheduling, monitoring, and alert notifications in case of failures.
- Platforms like Kubeflow, MLFlow, and Amazon Sagemaker also provide batch deployment and scheduling capabilities.
Web Service
The most common and widely used machine learning deployment strategy is a simple web service. It is easy to build and deploy. The web service takes input parameters and outputs the model predictions. The predictions are almost real-time and don’t require lots of resources also as they will predict one record at a time, unlike batch prediction which processes all the records at once.
Implementation
- To implement the predictions as a web service, the simplest way is to write a service and put it in a docker container to integrate with existing products. Though this is not the sexiest solution but probably the cheapest.
- The most common framework to implement an ML model as a service is using Flask. You can then deploy your flask application on Heroku or Azure or AWS or Google Cloud or just deploy using PythonAnywhere.
- Another common way to implement the ML model as a service is using Django app and deploying it using Heroku/AWS/Azure/Google Cloud platforms.
- There are a few new options like Falcon, Starlette, Sanic, FastAPI, and Tornado also talking space in this area. FastAPI along with Uvicorn server is becoming famous these days because of minimal code requirements and it automatically creates both OpenAPI (Swagger) and ReDoc documentation.
Why Online/Real-Time Predictions?
Above two approaches are widely used and almost 90% of the time you will be using either of two strategies to build and deploy your ML pipelines. However, there are a few concerns with both of these approaches-
1. Performance tuning of bulk size for batch partitioning.
2. Service exhaustion, Client starvation, Handling failures, and retries are common issues with web services. If model calls are asynchronous, this approach fails to trigger back pressure in case there is a burst of data such as during restarts. This can lead to Out of Memory failures in the model servers.
The answer to the above issues lies in the next two approaches.
Real-Time Streaming Analytics
In the last few years, the world of software has moved from Restful services to Streaming APIs, and so should the world of ML.
Hence another ML workflow that’s emerging nowadays is real-time streaming analytics, which is also known as Hot Path Analytics.
In this approach, the requests to the model/data load come as a stream (commonly as Kafka stream) of events, the model is placed right in the firehose, to run on the data as it enters the system. This creates a system that is asynchronous, fault-tolerant, replayable, and highly scalable.
The ML system in this approach is event-driven and hence it allows us to gain better model computing performance.
Implementation
- To implement an ML system using this strategy, the most common way is to use Apache Spark or Apache Flink (both provide Python API). Both allow for easy integration of ML models written using Scikit-Learn or TensorFlow other than Spark MLlib or Flink ML.
- If you are not comfortable with python or if there is already an existing data pipeline that is written in Java or Scala, then you can use TensorFlow Java API or third-party libraries such as MLeap or JPMML.
Automated Machine Learning
If we just train a model once and never touch it again, we’re missing out on the information more/new data could provide us.
This is especially important in environments where behaviors change quickly, so you need an ML model that can learn from new examples in something closer to real-time.
With Automated ML, you should both predict and learn in real-time.
A lot of engineering is involved in building an ML model that learns online, but the most important factor is the architecture/deployment of the model. As the model can, and will, change every second, you can’t instantiate several instances. Also, it’s not horizontally scalable and you are forced to have a single model instance that eats new data as fast as it can, spitting out sets of learned parameters behind an API. The most important part of the process (the model) is only vertically scalable. It may not even be feasible to distribute between threads.
Real-time examples of this strategy are — Uber Eats delivery estimation, LinkedIn’s connections suggestions, Airbnb’s search engines, augmented reality, virtual reality, human-computer interfaces, and self-driving cars.
Implementation
- Sklearn library has a few algorithms that support online incremental learning using the partial_fit method, like SGDClassifier, SGDRegressor, MultinomialNB, MiniBatchKMeans, and MiniBatchDictionaryLearning.
- Spark MLlib doesn’t have much support for online learning and has 2 ML algorithms to support online learning — StreamingLinearRegressionWithSGD and StreamingKMeans.
- Creme also has good APIs for Online Learning.
Challenges
Online training also has some issues associated with it. As data is changing often, your ML model can be sensitive to the new data and change its behavior. Hence mandatory on the fly monitoring is required and if the change threshold is more than a certain percentage; then data behavior has to be managed properly.
For example in any recommendation engine, if one user is liking or disliking a category of data in bulk; then this behavior, if not taken care of properly can influence the results for other users. Also, chances are that this data can be a scam, so it should be removed from the training data.
Taking care of these issues/patterns in batch training is relatively easy and the misleading data patterns and outliers can be removed from training data very easily. But in Online learning, it's much harder, and creating a monitoring pipeline for such data behavior can be a big hit on performance as well due to the size of training data.
Other Variants in Deployment Strategies
There are a few other variants in deployment strategies, like ad-hoc predictions via SQL, model server (RPCs) and embedded model deployments, tiered storage without any Data Storage, and Database as model storage. All these are combinations/variants of the above four strategies. Each strategy itself is a chapter, so it's beyond the scope of this article. But the essence is that deployment strategies can be combined/molded as per the business need. For example, if data is changing frequently but you do not have the platform/environment to do online learning, then you can do batch learning (every hour/day, depending on need) parallel to the online prediction.
Monitoring ML Model Performance
Once a model is deployed and running successfully in a production environment, it is necessary to monitor how well the model is performing. Monitoring should be designed to provide early warnings of the myriad of things that can go wrong in a production environment.
Model Drift
Model Drift is described as the change in the predictive power of the ML model. In a dynamic data system where new data is being acquired very regularly, the data can change significantly over a short period of time. Therefore the data that we used to train the model in the research or production environment does not represent the data that we actually get in our live system.
Model Staleness
If we use historic data to train the models, we need to anticipate that the population, consumer behavior, economy, and its effects may not be the same in current times. So the features that were used to train the model will also change.
Negative Feedback Loops
One of the key features of live ML systems is that they tend to influence their self-behavior when they update over time which may lead to a form of analysis debt. This in turn makes it difficult to predict the behavior of an ML model before it is released into the system. These feedback loops are difficult to detect and address especially if they occur gradually over time, which may be the case when models are not updated frequently.
To avoid/treat the above issues in the Production system, there needs to be a process that measures the model’s performance against new data. If the model falls below an acceptable performance threshold, then a new process has been initiated to retrain the model with new/updated data, and that newly trained model should be deployed.
Conclusion
At the end, there is no generic strategy that fits every problem and every organization. Deciding what practices to use, and implementing them, is at the heart of what machine learning engineering is all about.
You will often see when starting with any ML project; the primary focus is given to the data and ML algorithms, but looking at how much work is involved in deciding ML infrastructure and deployment, the focus should be given to these factors as well.
If you are looking for model deployment using batch and real-time, please refer-
Machine Learning Model deployment using Spark
Spark making ML deployment easier
towardsdatascience.com
Thanks for the read. I hope you liked the article!! As always, please reach out for any questions/comments/feedback.
Github: https://github.com/charumakhijani
LinkedIn: https://www.linkedin.com/in/charumakhijani/