Machine Learning in Production — How to Operate a Model factory
How to leverage MLOps principles for more impactful machine learning initiatives
A couple of years ago, I touched base on some of the key principles to handling the deployments of machine learning models. These days merely deploying machine learning models into production is often not sufficient. What matters is to operate them at scale, and this requires a very different process and ways of working. The term MLOps is often used to identify this new paradigm. MLOps places primary considerations towards automation and managing models across their lifecycles.
Managing MLOps considerations requires specific infrastructure to be in place to help manage different aspects of training, deployment or tracking their performances. This stage of maturity for automating certain aspects of Machine learning is often referred to as a Model Factory. Large internet companies such as Facebook or Uber have exploited such an approach with tooling such as FBLearn Flow or Michelangelo.
Understanding the Machine Learning lifecycle
There are multiple phases within a machine learning model lifecycle, from data collection to data cleansing, experimentation, debugging, or deprecation. Building a model factory means simplifying and automating each of these steps as much as possible to gain the fastest pace of iteration.
Data Collection
When looking at leveraging data to power an ML model factory, there are different things to consider. What initial data will be fed for the model training? How will the various labels be populated? What will be the metrics needed to monitor the model’s performance?
Approaches such as DMAIC from SixSigma methodology provide a Data collection plan as their empirical approach to process improvement and problem-solving. Most Machine Learning initiatives attempt to optimize processes and (should) embody similar improvement process methodologies.
For this purpose, it is essential to have some visibility on:
- What (raw) data might be needed to undertake this initiative
- What data is already collected
And understand the gaps between the two and what additional data collection effort might be needed. Robust metadata systems and data models help quicken the understanding of these data gaps.
There are different ways to remedy these data gaps: additional data logging, manual data collection, surveys, leveraging labeling/categorization systems such as AWS mTurk, or acquiring data from 3rd parties.
It is worth noting that, like for DMAIC, data collection should be taken as a continuous process improvement initiative. Some of the key data collection that should be handled after the initial launch:
- How do we ensure a stream of new data comes in for both Features and Training/Testing sets?
- How do we ensure to rely on user feedback for improvements and provide “more enriched” data?
- What additional datasets could be obtained to achieve better model performance?
Data Analysis
Analyzing the data is a necessary step of each machine learning iteration cycle, and it is essential to understand how the data behaves. Be it, for example, for computing a feature, e.g., for a yearly contract, potential installment values needed to be multiplied by an installment amount (e.g., 4 for quarterly) to obtain annual value, or understanding how specific labels have been assigned to particular subclasses of the population.
Data Analysis allows us to better inform the next steps in the process, be it feature engineering or model debugging. While some parts of the data analysis could be somewhat automated, a large part of the data analysis requires specific domain knowledge to make sense of the data and isn’t as easily automatable.
Some cloud tooling such as SageMaker / Glue Data brew help facilitate this step and provide users less comfortable with programming to leverage these tools.
Cleansing & Data Transformation
Data Cleansing and Transformation are two critical aspects of ensuring you have the correct data to train your model on. An important part to consider at this step is that the data cleansing and feature engineering that is done needs to be portable to the production workload. A large part of the cleansing and transformation process can also be automated to fasten the operationalization path, allowing only to deal with exceptions.
Data Cleansing: Before leveraging the collected data further downstream, a data cleansing process needs to happen. Features and targets need data cleansed to a certain degree before being able to be leveraged downstream. Some of the typical cleansing operations that need to be done involve removing observations (due to missing features, outliers) or imputing some of the data.
Feature Engineering: It applies transformations on top of raw data and can help provide more “insightful” data for models to train on. Feature Engineering work can be decomposed into two parts 1) Mathematical Transformations 2) Business Knowledge Embedded.
Some AutoML libraries, such as TPOT or AutoSklearn, provide utilities to create different mathematical transformations on top of raw data automatically. Nevertheless, manual crafting of features to help algorithms learn properly / the right thing. Cloud solutions such as SageMaker Autopilot, Sage marker data wrangler, or Azure Automated Machine learning cover some of these automatically.
An example of where this type of manual crafting can be necessary is when looking to train a model to predict when the next order will happen for customers of an eCommerce site. One issue that often occurs in eCommerce data is the inclusion of replacement orders in the general order feed and features computed downstream. Crafting targets and features to specifically separate replacemment orders generally allows for more accurate and/or impactful predictions.
Training:
When operating a model factory, the need for streamlining the training process increases. Being able to scale the number of offline experiments becomes increasingly essential, and with it, so is the need for tracking the results of these experiments.
AutoML plays a significant role in enabling this scaling, from supporting automated hyperparameter optimization to model & feature selection.
Tools such as MLRun, Weights, and Biases (WANMB) or SageMaker Experiments, help keep track of the different offline training experiments, store their hyper-parameters, and experiment results. They also allow tracking the different offline experiment metrics such as AUC/ROC curve, MSE, MAE…
Feature Store:
A feature store allows for more easily setting up an ML model factory. It provides a consistent interface for machine learning to source re-use of pre-calculated attributes for model training purposes.
Implementation
Feature store typically contains for a given [unit] daily, weekly, or monthly calculated metrics. A practical table structure for feature stores is that of a date snapshotted partitioned table, i.e., a table containing the metrics for that given day and partitioned by this date value. This allows to keep full traceability and versioning of the data and enables it to be treated as immutable (for a given partition), allowing for reproducible training of models.
Advantages
There are a few advantages to leveraging a feature store:
Simplification: Leveraging a feature store makes it easier to port models to production and have them in sync. Specific tools can be put in place, such as Netflix’s Bulldozer, for syncing the data between batch and operational stores, allowing for a simplified operationalization of models.
Discoverability & Re-use: This approach makes feature collection more re-usable, decreases the cost to set up the following models, and increases iteration speed. Part of the advantages of leveraging feature stores is that it helps codify some of the knowledge obtained during the data analysis phase, as hand-crafted features become available to the mass.
Processing Performance: It requires significantly less processing power than calculated for each individual at a hypothetical exposure time. Furthermore, the metrics/features can be pre-calculated, further reducing the cost of
Disadvantages
One of the main disadvantages of dealing with a feature store is data freshness and the challenges that come along with it in terms of signal, biased or noisy data and how it handles new records.
Accuracy/Signal: Metrics/features should generally be calculated before exposure to treatment to have the effect pollute these variables — but often, you want them to be calculated quite close to the exposure time to take into account the new information provided. Leveraging a feature store would not give the latest — most up-to-date information to power the prediction models, but rather the data as of the previous day, potentially providing less signal and accuracy.
Take the example of a feature based on add-to-cart events to predict the likelihood of a customer making a purchase. Like for most events, the signal contained within these add to cart events is likely to decay over time, and the likelihood to purchase after having just added an item to the cart is expected to be drastically different than if the action was performed a day, a week or year ago.
‘Bias ’/noise: with a daily job, the freshness of information differs for each record. A user who purchased at 1 am after having visited at 11.45 will have data more in line with the action that s/he intends to perform, while one that initially visited the site at 12.30 am and comes back at 11.45 pm the next day might have less intent to continue the same actions s/he was doing initially. This is particularly of concern when features/targets that require particularly fresh signals in order to be useful.
New record: Since we would only tend to use this approach to rely on that as fresh as yesterday, it would not be able to properly handle newly arriving records as the feature store would be lacking history for them. This situation is one reason that typically leads to a distinction in how new and existing users/customers tend to be addressed.
Data freshness can significantly impact the training and predictions of machine learning algorithms, and its impact will be highly dependent on the specific features. What is being predicted, features such as place of birth would not be very impacted, besides the handling of new records, by data freshness, and getting another add to cart event would likely not be as impactful for predicting gender as for predicting purchase if customers have already a good purchase history.
Considerations
There are multiple tradeoffs to take into account when looking to leverage a feature store. Tradeoffs exist for processing needs, complexity, latency, recency of information, the accuracy of the data, etc. All of these considerations for serving purpose have implications on the data used for training, as data used to train a machine algorithm should closely represent the type of data being fed for prediction purposes.
Metrics/features should usually be calculated before exposure to treatment not to have the effect pollute these variables — but often, you want them to be calculated quite close to the exposure time to take into account the new information provided.
On the other hand, features often need to be computed in real-time. Performing the feature calculation at prediction time might not be wise, instead of relying on pre-computed metrics or dynamic counters to allow fast response time.
The impact of leveraging this type of system will vary significantly by your use case and the importance of having freshly computed.
A few frameworks are available that provide feature store frameworks, such as the SageMaker feature store or the open-source hopsworks or feast frameworks. A high-level evaluation of Michelangelo (Uber’s Feature Store), Feast (Kubeflow’s feature store), Tecton, and SageMaker is provided by Pawel here.
Event Store:
An event store typically refers to one of the critical components of an event-driven/event sourcing architecture. The name is also leveraged in decisioning and real-time processing by vendors such as Pega or Tibco to refer to a long-term persistence layer for events.
In this more generic definition, the event store, like the feature store, simplifies the deployment in production. It provides a layer for storing the raw information. It helps with providing all the necessary data to compute the different features (daily, weekly, etc.), exposure & treatment points, predicted model scores, etc.
Leveraging events store can help power many use cases, such as calculating features at exposure time, debugging model predictions, or calculating the A/B test.
An event store can provide more accurate features than a feature store; features can be computed as close as desired to the exposure or treatment point, potentially capturing an increased signal close to the decision point. It also allows to better handle cases such as the one of a new user prediction, for instance, through leveraging freshly captured data. Tracking and debugging model predictions is also easier with that architecture as it is possible to track the sequence of events, the predictions, and the actions made afterward. The same is true for tracking the performance of experiments using A/B, as the metrics can be computed directly after exposure points.
Leveraging the data contained within an event store for machine learning purposes often requires an added degree of complexity. This is the case when a sequence of transformations and aggregations is needed to prepare the data for a prediction. These transformations are often implemented through real-time stateful enrichment approaches. One of the challenges with going this route rather than leveraging pre-computed features is the difficulty of keeping the transformation code between training and what is used in production.
Hybrid Feature Computation approach
Mixed approaches are possible; it is, for instance, possible to include within a (serving) feature store both real-time and daily computed features.
This comes at the cost of increased complexity, particularly around the management of real-time data. Like in the event store case, it is needed to sync the transformations used for training and production. However, it comes around with the added complexity of having to treat some features as loaded daily versus computed in real-time.
There are benefits from leveraging this type of approach compared to a pure feature or event store. Not all types of data/features benefit from being computed in real-time and might not warrant the cost of instrumenting them in real-time. Take, for example, the use case of churn prediction; we know that time since joining a service impacts related metrics such as activity rate. For this type of use case, it is pretty unlikely that having such a feature as time since joined
computed in real-time versus daily would have a massive impact on the predictive powers of the models.
Model selection
There are different aspects to consider when looking at what models to leverage in a model factory. Different tradeoffs need to be considered when looking to leverage models in a model factory.
There are different dimensions to consider: the interpretability of the model, its’ computational intensity, the model complexity and size, the minimum amount of data needed to be effective, …
For many use cases, conventional models such as general linear models (GLM), Random forest (RF) might be more appropriate than trying to leverage advanced deep learning techniques instead of using the available computation to train the model on more data or do further training iterations.
“Most machine learning methods tend to perform similarly, if tuned properly, when the covariates have an inherent meaning as predictor variables (e.g., age, gender, blood pressure) rather than raw measurement values (e.g., raw pixel values from images, raw time points from sound files)” — The Secrets of Machine Learning: Ten Things You Wish You Had Known Earlier to be More Effective at Data Analysis, Cynthia Rudin, David Carlso
Interpretable models
Leveraging interpretable models for production use cases has multiple benefits, from easier debugging or helping build trust in and out of the organization.
debugging:
Debugging ML models can be pretty intensive. It requires going through the data used to train the model, the code used to train the model, the data used for prediction, the code used for prediction, and the different sets of predictions created. From that, it is necessary to understand:
- Why does the model provide a specific prediction
- Why was the model got trained in such a way? Is it predicting out of sample? Is the model missing an interaction parameter? Was a feature binning too aggressive? Is it using a linear model/feature when the data shows an exponential relationship, etc.?
- What is the overall impact in terms of the occurrence of these predictions?
- What should be done to mediate the impact
All of these make debugging machine learning models quite a complex affair.
Some tools exist, such as Google’s what-if tool to help debug, but relying on a more straightforward, interpretable model can significantly speed up the debugging process.
Trust
Another reason to favor interpretable models is trust, and more straightforward modeling approaches are easier to interpret and explain and more easily build confidence that the modeling approach is sound. Regulated industries might, for this reason, require fully interpretable models.
Another consideration that weighs in the balance of interpretable models is the aspect of fairness. Nowadays, some tools help with the interpretation of machine learning models, and the issue of fairness has seen its own tooling being developed for it [1].
AUTOML
AutoML helps perform some of the work needed with regards to model selection, data preparation (e.g., imputation, encoding (1 hot, etc.), feature hashing, bin counting, etc.), feature engineering (selection/extraction / meta-learning, etc.), or hyperparameter setting/optimization.
It provides a way to standardize parts of the training process, automate and increase the speed of iteration of machine learning experiments, make machine learning more accessible or increase the model’s accuracy.
Kai R. Larsen from the University of Colorado Boulder and Dan Becker from Google came up with Eight criteria for AutoML Excellence: Accuracy, Productivity, Ease of use, Understanding, and learning, Resource availability, Process transparency, Generalizable across context, Recommend action.
There is a wide range of support for AutoML from libraries such as TPOT, AutoSKLearn, or Cloud-based solutions such as AWS SageMaker “Autopilot,” Azure Machine Learning, or embedded into solutions such as Salesforce Einstein. An evaluation of the different AutoML approaches and tools is available on Arxiv.
Leveraging AutoML, however, poses its’ own set of challenges and has broader implications, for example, on talent strategy.
Infrastructure needs for ML in Production
An online feature store is often the first piece of infrastructure needed in production to be on the path to setting up a machine learning model factory.
Other components are needed to take advantage of it, such as :
- a robust CI/CD pipeline to facilitate the automatic deployment of these models into production
- A workflow automation tool
- A decision engine to allow for easier operationalization of the predictions
- An experimentation stack to be able to track the impact of the deployment of machine learning models into production
- A monitoring stack that allows for tracking model performance and detecting the number of outlier predictions
Operationalizing ML Models in Productions
A few aspects are important to operationalize appropriately, and three are key: Experimentation, Automation, and Monitoring.
Evaluation / Experimentation
There are three pillars of evaluation and experimentation for machine learning models:
- Offline testing is all about leveraging and tracking the different training experiments using training and test set to get preliminary insights into the model performance.
- Backtesting: is a particular type of offline testing, providing some insights on the model performance by testing it on historical data.
- Online ”Experiment Tracking”: We need to be able to understand the overall impact of the ML models, not just in terms of prediction accuracy metrics, but on overall business impacts. Using an experimentation approach like A/B or multivariate testing allows providing this level of visibility.
Workflow Automation
Different machine learning-specific workflow tools have emerged to help automate the various steps needed to productionize ML. To name just a few: MLFlow, Kubeflow, Dagster, or Netflix’s Metaflow.
These types of workflow automation engines differ from more traditional workflow engines such as Airflow or Argo Workflow in that their focus isn’t as much scheduling as working with a highly specialized pipeline requiring, for instance, storing data artifacts, integration onto model serving platforms, or highly parametrized workflows. Pawel Koperek wrote an evaluation of Kubeflow, providing in less than 500 words an overview of some of the features of such tooling.
These tools’ purposes are to simplify the model lifecycle workflow and increase the rate of experimentation.
Monitoring & Alerting
Running machine learning models in production requires proper monitoring. Model performance can degrade over time, a phenomenon referred to as “Prediction drift,” unexpected outliers can cause out-of-bound prediction or Mix shift within the data, potentially causing a more significant impact than expected.
Tools such as Sagemaker Monitor or Azure ML offer features that make this operationalization easier.
Summary
Operating a model factory focuses on automating a few phases of the machine learning training and deploying lifecycle. Andreessen Horowitz has provided an architectural blueprint showcasing what it takes to manage AI and Machine learning use cases in modern data architecture. Many components end up being involved, from Feature stores to workflow automation tools, etc.
More from me on Hacking Analytics: