Monitoring Machine Learning Accuracy in the Enterprise

Adam Blum
@auger
Published in
5 min readJan 21, 2021

Machine learning tools and practices continue to develop at a dizzying pace. The industry has moved on from running Python tools on the data scientist’s local laptop to hosted services from all the major cloud vendors that inexpensively build accurate enough models with many popular algorithms, both traditional (such as random forest, support vector machine and gradient boosting) and neural network-based deep learning algorithms.

The next phase in machine learning was the use of “automated machine learning” to try all likely ML algorithms and hyperparameter settings. This AutoML category was pioneered by the likes of DataRobot and academics at University of Freiburg and elsewhere. Last year Google, Microsoft and Auger.AI entered the fray with much less expensive cloud-based tools in the $10-$20 per training hour range.

Finally accurate model generation was accessible to the casual data scientist and “citizen data scientist”. Given a spreadsheet, a target to predict and some features they can execute a training run and get a hosted prediction endpoint within hours.

This very ease exploded the availability of predictive models. The proliferation of models across the enterprise created a new problem. When a company had just a few models it was possible to monitor the ongoing accuracy of these models realtime. Given deep knowledge of the problem domain, the advanced data scientist could determine levels of acceptable accuracy that needed to be achieved before a model was retrained or built again from scratch. Remember, scoring metrics such as R2 and F1 don’t tell you anything about the suitability of the model for use. A profitable regression model may well have an R2 of 0.51. A classification model with F1 of 0.9 may not be acceptably accurate given a specific business use.

And of course the scoring metric used to assess an acceptably trained model will almost always be higher than the accuracy experienced after the model is deployed. Even when using cross-fold validation, training only fits the model to available data. Real world data experienced after model training will usually result in much lower accuracy. The translation of model scoring metrics to business value and ROI has traditionally required deep data science and business domain expertise. Citizen data scientists generating quick models via AutoML and other cloud ML services just don’t have this knowledge.

The data science world is just waking up to this reality. Recognition of the problems of data drift (significant changes in feature independent variables) and concept drift (fundamental changes in the dependent target variable) is now becoming widespread. Practitioners now know that all models decay, although the expertise to determine that a certain level of inaccuracy makes a predictive model counterproductive to use isn’t widely known.

MLOps tools, often called “workbenches”, such as Amazon Sagemaker and Dataiku have tried to announce their ability to solve this problem with a claim that “we do everything in the lifecycle beyond just training the model”. And a few of them even manage to provide a single “real world experienced accuracy” number among the many things that they do. None of them attempt to translate that accuracy into business results, value delivered and ROI. This is left as an exercise to the most trained data scientists, who cannot possibly do this for the wealth of new models created.

Machine Learning Review and Monitoring (MLRAM)

We at Auger.AI, as builders of the most accurate AutoML tool, witnessed this problem first-hand. Customers built models with Auger, and then started executing predictions with them. Some never retrained those models. Others trained them daily with the result being huge training bills. We didn’t see any that appeared to try to intelligently retrain only when the model became inaccurate. In discussions with them we weren’t able to eke out the ROI of the predictive model overall and the business value of the incremental accuracy we delivered (often Auger replaced a less accurate model built with other tools).

We saw the need for a tool that would report on granular real-world accuracy delivered realtime to the citizen data scientist. This would allow us to deliver optimal retraining to the customer. We also saw the need for a configurable mechanism to determine ROI on the existence of the predictive model and the business value of more accurate models. We could not find anything like this among the dozens of data science workbenches filling the booths at the shows we attended (ah those pre-pandemic days!). Yes, you can get a single accuracy number, but nothing like true analytics of accuracy. And certainly nothing that computed ROI and business value.

So we created MLRAM. MLRAM lets you monitor the accuracy of your model, aggregated at several different levels of time granularity (you can think of it as OLAP for accuracy if you are a longtime big data practitioner). We also give you analytics on how your features and target are changing. If you see drastic changes and trends in your features and target, your model is likely going to degrade quickly. Drastic changes in model accuracy during certain periods of time can also be clues to a model that needs to be enhanced with new features (training may not be sufficient, you may need to build a new expanded predictive model).

Finally, by asking the citizen data scientist a few basic questions about the real world usage of the model (filter criteria of predictions to use as “events”, investment per event, and the financial return per event) we can compute the ROI of the predictive model. We can even compute the business cost of the delivered inaccuracy from this data. And we can thus determine the business value of being more accurate than a baseline model. All of this ROI and business value can be computed with zero code written by an expert data scientist or developer. Such programs are traditionally quite complex for many reasons, including the fact that false positives and false negatives almost always have asymmetric costs.

What You Need To Do

All that you need to get the value of insights into real world predictive model accuracy, and the ROI and business value delivered of those models, is embed a single line of code into your apps which use predictive models. This code just tells MLRAM the predicted value of the target and the actual value. You can start with a low level plan with just 100 predictions/actuals per month. As you progress to higher volumes of predictions, the price is reduced to pennies per prediction. You can also try it with just a spreadsheet of predictions and actuals that you upload. Contact sales@auger.ai for more details on how to get started with MLRAM.

--

--

Adam Blum
@auger
Editor for

CTO of Empath— Technical co-founder, dad, author, ultramarathoner. Building ML products in four different decades…