Why you need to babysit ML models after deployment?

🤖 nannyML for post-deployment model monitoring [part-1]

4 min readMay 11, 2024

Ever wonder why keeping an eye on your machine learning models after deployment is crucial? Let’s explore a fascinating story that explores the concept of post-deployment model monitoring, through weekly sales forecasting in a retail store.

Following this story, we’ll get hands-on with a real Walmart sales dataset. We’ll build a sales forecasting model and monitor performance of this ML model in production. We will mimic the production environment directly within our Google Colab notebook — making it easy to follow along.

We will find out how machine learning models degrade over time due to real-world changes, and why lack of monitoring can lead to significant financial losses. Next, we will see how to use nannyML for performance monitoring of ML models and why there is a need for probabilistic models invented by nannyML.

Mr. Danny’s Retail Store

Imagine Mr. Danny opens a large retail store, similar to Walmart. As the store expands, he wishes to forecast his weekly sales. Consequently, he hires a data scientist to implement a machine learning model that can predict his weekly sales. This allows him to plan for stock and demand in advance, preparing for upcoming trends and sales.

The data scientist developed an impressive model to forecast Mr. Danny’s weekly sales, achieving 97% accuracy on the test data. After deploying this model, the data scientist bid farewell. :*)

Image showing Data scientists should also take charge of ML Monitoring

Mr. Danny initially used the model in production. However, after some time, he noticed the model began to fail and gave incorrect predictions. He incurred a financial loss. Can you guess why the model performance degraded⁉️

Mr. Danny, who had limited technical knowledge, hired an ML Engineer. He identified the cause of the previous model failure as data drift. Simply put, data drift happens when real-world data changes in ways the model wasn’t trained for, leading to a decline in model performance.

The ML Engineer provided two possible reasons for the decline in model performance:

Univariate drift: The temperature pattern suddenly changed during production, which altered shopping patterns. For example, there was an increase in sales of sunscreen and soft drinks.
Multivariate drift: Changes in the employment rate, recession, and temperature collectively influenced the sales pattern and consumer behavior. The complex relationship between these factors changed, resulting in Multivariate drift.

To understand them more deeply visit: How to Detect Data Drift with Hypothesis Testing (nannyml.com)

ML Engineer advocated for placing data drift at the center of monitoring solutions. Mr. Danny accepted his idea and started model performance monitoring. He received numerous alerts from the model on daily basis. This caused him significant mental stress.

When the Mr. Danny received the ground truth value (the actual ‘weekly_sales’ of his store for that week), he discovered that most of the alerts were false. more than 90% of the alarms were false and only 10% of alarms correctly indicated a decline in model performance. As a result, focusing on data drift as the central monitoring strategy proved unsuccessful.

Note: Data drift detection is crucial, but it can distort the true performance of an ML model in production. It is recommended to use these methods in later stages of the monitoring process, such as in root cause analysis. Here, they serve as tools to identify and explain the factors impacting model performance.

Finally, Mr. Danny discovered a library called nannyML. This tool can act as a babysitter for his ML model, continuously monitoring its performance without generating false alarms. It allowed him to estimate his ML model performance (without access to the ground truth) and detect data drifts.

NannyML/nannyml: nannyml: post-deployment data science in python (github.com)

Next, we’ll use a popular Walmart sales dataset available on Kaggle. Let’s assume Mr. Danny has similar data for his store. We’ll explore how to apply NannyML for model monitoring and estimating model performance.

This is the historical data that covers sales from 2010–02–05 to 2012–11–01. It contains data from multiple Walmart stores situated across the United States.

In the next part of this article, we will:

Conduct retail demand forecasting to predict weekly sales using this data.
Apply NannyML to this data to determine post-deployment model monitoring.
Investigate why most alarms triggered in Mr. Danny’s case were false.

👉Visit Part-2: Why you need to babysit ML models after deployment? [part 2]

Why you need to babysit ML models after deployment?

🤖 nannyML for post-deployment model monitoring [part-1]

Mr. Danny’s Retail Store

In the next part of this article, we will:

Written by Ansh Tanwar