Popmon Open Source Package — Population Shift Monitoring Made Easy

A disruptive open-source python package developed by ING WB Advanced Analytics enabling the monitoring of ML model input data and predictions over time.

Nikoletta Bozika
inganalytics.com/inganalytics

--

ING WB Advanced Analytics (WBAA) is a tribe that builds 10x data-driven products. Within our tribe, the CoreAlgos-Research squad works on developing data science models and analytics tooling to be leveraged for our products while at the same time, to benefit the wider data science community.

Max Baak — Data Science Lead and Tomas Sostak — Data Scientist at WBAA, have worked in filling a significant gap regarding the monitoring and from there improvement of the performance of machine learning (ML) models that are running in production. Identifying the need, they developed popmon which stands for (population shift monitor), an open-source python package that allows checking the stability of a dataset over time.

Max Baak — Data Science Lead

“During an internal ING review of one of WBAA’s algorithms, one finding was that we weren’t monitoring for population shift problems well enough. We found that there are no good open-source packages out there that allow us to monitor input data and predictions for such shifts in a straight-forward, automated way. The monitoring provided in existing model management tools, if present at all, is too basic. When I worked at CERN I used to perform data quality monitoring myself. Therefore, I had some very well-developed ideas on how to approach this, using techniques from statistical process control. To monitor the stability of data and predictions over time we have developed popmon. When we had this ready for our use-case in WBAA we decided to turn it into a generic package and open-source it”.

Why is the Monitoring of ML Models Performance important?

Before deep-diving to popmon, it’s useful to explain why this tooling is so important. Organizations that want to develop an ML capacity do so in order to assess certain useful insights, for example, to predict revenues, forecast clients’ due diligence, predict risk indicators, optimize customers’ network, and more. This process includes some basic yet important steps, also known as the machine learning development cycle.

Firstly you need to understand and define the business problem you are trying to solve with your ML model which also defines the right data to use. From there, you need to collect and prepare the data to train your ML model. The quality of this data will play a major role in how good your model’s predictions will be. Then you need to train and test your models and compare the results to end up with the best model that will be deployed and will run in your production environment.

This model will then be ready to predict outcomes based on a continuous input stream of data. However, the input stream of data in the production environment is far from stable. There are numerous anomalies that may affect the performance of your model and from there the predictions it generates. For example, if the input data is a dataset that shifts in shape slowly or suddenly, it will, in turn, affect the outcome a model generates. This effect is called population shift and it’s one of the reasons why the monitoring of a model’s performance over time is crucial, in order to maintain good predictive power and avoid costly results.

Tomas Sostak — Data Scientist

popmon could be very useful to anyone doing serious monitoring of ML models running in production, as well as, simple exploratory data analysis. It is important to understand data patterns and how they change over time (e.g. seasonality, trends), observe anomalies (if any), and decide how to proceed with the data in order to achieve a consistent and robust performance of your model. popmon has already helped us in numerous cases within ING, therefore we believe it can benefit the whole open-source community in taking (better monitored) data-driven decisions too.”

How does popmon work?
popmon checks the stability of a dataset over time. It does so by taking as input a DataFame — either Pandas or Spark — where one of the columns should represent the date, and will then produce a nice-looking report that indicates how stable all columns are over time.
For each column, the stability is determined by taking a reference (for example the data on which you have trained your classifier) and contrasting each time slot to this reference. This can be done in various ways:

  • Profiles: for example tracking the mean over time and contrasting this to the reference data. Similar analyses can be done with other summary statistics, such as median, min, max, or quartiles.
  • Comparisons: statistically comparing each time slot to the reference data (for example using Kolmogorov-Smirnov, chi-squared, or Pearson correlation).

The reference can be defined in four different ways:

  1. Using the DataFrame on which you build the stability report as the reference, essentially allowing you to identify outlier time-slots within the provided data.
  2. Using a separate reference DataFrame (for example the data on which your classifier was trained, as in the above example), allowing you to identify which time slots deviate from this reference DataFrame.
  3. Using a sliding window, allowing you to compare each time slot to a window of preceding time slots (by default the 10 preceding time slots).
  4. Using an expanding reference, allowing you to compare each time slot to all preceding time slots.

We use traffic lights to indicate where large deviations from the reference occurred. To see how this works, consider the following example. Suppose we have a value of interest (either a profile, i.e. summary statistic tracked over time; or a z-score normalized comparison, i.e. a statistical test, e.g. Kolmogorov-Smirnov). To determine the difference compared to the reference, we also compute the value of interest on the reference data (top panel) and determine the mean and standard deviations across time units (center panel). We then determine the traffic lights as follows:

  • Green traffic light: indicates that there is no meaningful difference compared to the reference, i.e. the value of interest is less than four standard deviations away from the reference.
  • Yellow traffic light: indicates that there is a moderate difference compared to the reference, i.e. the value of interest is between four and seven standard deviations away from the reference.
  • Red traffic light: indicates that there is a big difference compared to the reference, i.e. the value of interest is more than seven standard deviations away from the reference.

The exact thresholds can be configured as a parameter. These traffic light bounds are then applied to the value of interest on the data from our initial DataFrame. If the traffic light is red, we can generate an automatic alert. For the speed of processing, the data is converted into histograms prior to the comparisons. This greatly simplifies comparisons of large amounts of data with each other, which is especially beneficial for Spark DataFrames.

Summary of Reporting with popmon

The output of popmon is essentially an HTML report which can be viewed directly in the Jupyter notebooks, as well as, easily exported to a file and opened with any browser.

At this point, the report has a settings section where a user can configure what they can see in the report and 5 different sections, each containing a specific type of data. The sections are placed in the following order: profiles, comparisons, traffic lights, alerts, histograms.

There are many different settings that can be provided as arguments to the function used for the generation of the report (e.g. reference type, short vs extended report, time binning, monitoring rules) and can be found in our extensive documentation.

popmon enables you to store the histograms together with the report (since the histograms are just a fraction of the size of the original data), making it easy to go back to a previous report and investigate what happened. There is also a possibility to zoom in on the histogram data and see how the distributions were changing throughout the latest weeks (in the last section — histograms).

Profile and comparisons sections include plots of basic statistics and statistical comparisons to the reference set (respectively).

The traffic light section outlines which specific statistics deviate from the reference too much (resulting in yellow or red traffic light) whereas alerts section aggregates all the traffic lights on the feature level (all statistics combined for that feature). It is important to note that all the before-mentioned plots have time units as the x-axis and all the plots are filtered based on the data feature.

Developing such a demanding package comes with challenges that Max and Tomas addressed successfully despite the extra time and effort needed. They both highlighted as main one the shift into an engineering heavy project that nevertheless, led to great knowledge and made the code a lot better.

“Developing this package was a real challenge since by nature we are data scientists, not software engineers. Nevertheless, data science knowledge was essential when implementing statistical tests and checking if the outputs of the functions made sense, as well as, integration of the modules as a whole” — Tomas

Machine Learning is powerful in solving business problems, deducing great ideas, and making powerful predictions. Nevertheless, the constant monitoring of an ML model is essential to obtain the best results and this is a very common challenge in the field. Human needs to be kept in the loop and set up the right tooling to ensure that an ML model will keep on generating accurate predictions. This is what popmon does and is expected to offer significant help in the data science and data engineering community.

“The performance monitoring of ML models in existing popular model management tools, such as mlflow, generally focus on metrics from the model retraining cycle, quantities like precision, recall or AUC. What they don’t do is monitor the stability of the distributions of the input data or the outgoing predictions. And you need all three to form the complete performance picture. So I really think popmon is filling a gap in the market and is interesting to anyone doing serious monitoring of ML models running in production” — Max

You can find the popmon python package including all documentation, open-sourced at Github via the link below:

--

--