POPMON v1.0.0: The Dataset-Shift Pokémon

A complete open-source light-weight Python library for monitoring the stability of data over time. Now even better: release v1.0.0 brings exciting new features and huge performance improvements.

Pradyot Patil
inganalytics.com/inganalytics
4 min readAug 3, 2022

--

POPMON Recap

popmon(population shift monitor) is an open-source Python library developed by ING Global Analytics. popmon checks the stability of a tabular dataset, or in other words, it checks for a “Dataset shift” in a given dataset. For example, this can be particularly useful for monitoring the performance of Machine Learning (ML) models in production, which may be affected by changing input data.

With just one line of code, popmon gives a comprehensive HTML report showing, over time: alert overviews, histogram plots, heatmap plots, and statistical profiles for each feature in the dataset. It was released in April 2020 in the development version. After steady progress and contribution, INGA has now released popmon v1.0.0with a bunch of exciting new features and huge performance improvements.

This article focuses on the new features and performance improvements incorporated in thepopmon v1.0.0release. More about popmon and its initial release can be read in the earlier Medium article.

Max Baak (Chapter Lead Data Science, INGA), Simon Brugman (Data Scientist, INGA) , Tomas Sostak (Data Scientist, Vinted), and Pradyot Patil (IT trainee, INGA) have been continuously working on the development of popmon since its initial release to make it a stable and complete tool.

Checkout POPMON at SciPy 2022 🎉

popmon will be featuring at SciPy 2022 conference in Austin, Texas. Look out for the poster at the conference, and the related popmon paper can be found here. The annual SciPy Conference brings together attendees from industry, academia, and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development.

New Features

popmon v1.0.0has been released with a long list of new features . In this article, we highlight some of the exciting features.

Alerts per feature: wouldn’t it be great if the report directs your attention to features with possible flaws and outliers in your data quickly? The overview of alerts per feature does exactly that.

Heatmaps: these can be great to assist viewers to visualize data that matters, especially to track changes in data over time. With the new version, for each categorical feature, three heatmaps are plotted: normal heatmap, column-normalized heatmap, and row-normalized heatmap.

Heatmap in popmon v1.0.0

Interactive Plots: one of the most exciting features included in v1.0.0. The matplotlib plots in previous versions have been now completely replaced by Plotly plots. These plots are high quality, light-weight, and enable viewers to interact with a plot with simple plot tools.

Profile & Comparisons extension mechanism: A huge part of the popmon report are statistical comparisons of each time slot to the reference dataset and statistical profiles of each feature. The user can now easily extend these by adding statistical tests with the newly available extension mechanism. Moreover, in the new release, entropy is included as a profile for the features.

Even more pre-defined reference types: popmon now provides even wider reference configurations that are helpful in common user flows

Dashboarding integration with Kibana: popmoncan be part of any data workflow be it for data monitoring or exploratory data analysis. The repository now provides resources to kickstart integrating popmon with Kibana.

Performance Updates

Faster report generation & reduced report size: replacing matplotlib with Plotly helped popmon to generate reports blazing fast and keep the report size low. The new plots are simply stored as json strings. When the report is generated, plotly.js takes the json strings as input to plot graphs. This results approximately in a 90% drop in report generation time on average. The new version cleverly reuses the json layout data for plots to reduce the report size significantly. The report size has now been reduced by approximately 80% compared to previous versions.

Lazy loading of plots: the new version incorporates the lazy loading technique. Lazy loading renders a plot only when the plot is scrolled into the view. This enables the smooth rendering of plots and less weight on the client browser.

Histogrammar v.1.0.30: popmonuses histogrammar for creating histograms. Histogrammar is python implementation for creating histograms with Numpy, Pandas, and Spark. The recent performance updates to v1.0.30, binning the data for histograms more cleverly, also mean a performance upgrade for popmon.

Outlook

With the release of v1.0.0 , popmon is now a complete product ready to be used in various data workflows. Especially in Machine Learning workflows, where ML models are trained on historical data and then deployed in the production environment. However, the performance of the model might be affected by the unstable input data stream, for example, if a data shift happens its performance may significantly differ from that of the training data.

Can you think of a usecase or workflow at your organization where popmon can be useful? Go ahead and try it out. If possible, let us know how it can be improved. Contributions to the project are also welcome, here is a good starting point!

--

--

Pradyot Patil
inganalytics.com/inganalytics
0 Followers

Programmin Enthusiast | Data Engineering | Analytics