Why you need to babysit ML models after deployment? [part-2]

🤖nannyML for post-deployment model monitoring

8 min readMay 12, 2024

Visit part 1 to understand why constant monitoring of ML models are essential . Part 1 includes an intriguing story about Mr. Danny’s Retail Store.

In part-2, we will do some interesting stuff: -

Conduct retail demand forecasting to predict weekly sales using this data.
Apply nannyML tools on this data to determine post-deployment model monitoring.
Investigate why most alarms triggered in Mr. Danny’s case were false.

Introduction to the weekly sales data

We’ll be utilizing a Walmart sales dataset, a widely used dataset accessible on Kaggle. This dataset encompasses historical sales data from multiple Walmart stores situated across the United States. Our objective is to conduct retail demand forecasting and forecast the weekly sales.

This is the historical data that covers sales from 2010–02–05 to 2012–11–01:

Feature Description

Store - the store number
Date - the week of sales
Weekly_Sales - sales for the given store
Holiday_Flag - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week
Temperature - Temperature on the day of sale
Fuel_Price - Cost of fuel in the region
CPI – Prevailing consumer price index
Unemployment - Prevailing unemployment rate
Month, Year,and Season - Time-related features

EDA and Preprocessing

Let’s perform some basic Exploratory Data Analysis and analyze the distribution of some features.

We observe that CPI and Fuel_Price have bimodal distribution, and Temperature and Unemployment have normal distribution.

We can see that the Weekly_Sales distribution is right skewed as it has some outliers.

Weekly_Sales are higher in winter and holidays, especially in November and December

The Above Correlation Heatmap of all the features, shows interesting relationships between each input feature.

Preprocessing and outlier removal

#Remove outliers
num_features = ['Temperature','Fuel_Price','CPI','Unemployment','Weekly_Sales']
for feature in num_features:
    q1 = df[feature].quantile(0.25)
    q3 = df[feature].quantile(0.75)
    iqr = q3-q1
    lower = q1 - 1.5*iqr
    upper = q3 + 1.5*iqr
    df = df[(df[feature] >= lower) & (df[feature] <= upper)]

# scaling numerical variables
sc = StandardScaler()
df[num_features] = sc.fit_transform(df[num_features])
categoric_columns=['Store','Season']

# encoding categorical features
df[categoric_columns] = df[categoric_columns].astype('category')
encoder = BinaryEncoder(cols=categoric_columns)
df = encoder.fit_transform(df)

Divide and prepare data for nannyML

Typically, we split our data into training, validation, and test sets. However, in nannyML, we split the data into four parts. Model monitoring workflows require another set to mimic production data. This is done to ensure that our system correctly detects performance drops by using the right algorithms and reports what went wrong.

A reference set is another name for the test set used in model monitoring context. NannyML uses the model’s performance on the test set as a baseline for production performance.

analysis set is the data that contains the production data with predictions made by the model and the ground truth value is not available here(in our case the ground truth value is future week’s sale of our retail store)

# Note- We are not creating a validation data here.
# Refere the image above to get a better understanding of this code.

# Create data partition
df['partition'] = pd.cut(
    df['Date'],
    bins= [pd.to_datetime('2010-02-12'),
           pd.to_datetime('2012-02-12'),
           pd.to_datetime('2012-06-12'),
           pd.to_datetime('2012-10-26')],
    right=False,
    labels= ['train', 'test', 'prod']
)

# Set target and features
target = 'Weekly_Sales'
features = [col for col in df.columns if col not in [target, 'Date', 'partition']]

# Split the data
X_train = df.loc[df['partition'] == 'train', features]
y_train = df.loc[df['partition'] == 'train', target]
X_test = df.loc[df['partition'] == 'test', features]
y_test = df.loc[df['partition'] == 'test', target]
X_prod = df.loc[df['partition'] == 'prod', features]
y_prod = df.loc[df['partition'] == 'prod', target]

So after splitting our final data distribution is :-

X_train and y_train: data from 2010-02-12, (4725 data points)
X_test and y_test: data from 2012-02-12, (945 data points)
X_prod and y_prod: data from 2012-06-12 , (675 data points)

Train machine learning model

We will now fit a LightGBM regressor model to the training data, make predictions on both the training and test sets, and compute the mean absolute error (MAE) for both the model predictions and baseline prediction.

#Fit the model
model = LGBMRegressor(random_state=111)
model.fit(X_train, y_train)

# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Make baseline predictions
y_pred_train_baseline = np.ones_like(y_train) * y_train.mean()
y_pred_test_baseline = np.ones_like(y_test) * y_train.mean()

# Measure train, test and baseline performance
mae_train = mean_absolute_error(y_train, y_pred_train).round(4)
mae_test = mean_absolute_error(y_test, y_pred_test).round(4)
mae_train_baseline = mean_absolute_error(y_train, y_pred_train_baseline).round(4)
mae_test_baseline = mean_absolute_error(y_test, y_pred_test_baseline).round(4)

To evaluate the model, we will compare its training and testing MAE(Mean Absolute Error) with that of a baseline model which consistently predicts the mean of the weekly_sales column.

We plotted two scatter plots, one with the actual and predicted values for training and a similar one with the predicted values for the testing data. Both mean absolute errors are relatively low, meaning the model performs well enough for this use case.

we also calculated the feature importance, revealing that the top three features are CPI (Consumer Price Index), unemployment rate and Fuel_Price.

Estimating Performance in nannyML

NannyML provides two major algorithms for estimating the performance of regression and classification models:

We will use the Direct Loss Estimation algorithm (DLE) for our regression task . DLE can measure the performance of a production model without ground truth and report various regression pseudo-metrics such as RMSE, RMSLE, MAE, etc.

y_pred_prod = model.predict(X_prod) #predict the weekly sales for production data

reference_df = X_test.copy() # using the test set as a reference
reference_df['y_pred'] = y_pred_test # reference predictions
reference_df['Weekly_Sales'] = y_test # ground truth (currect targets)
reference_df = reference_df.join(df['Date']) # date

analysis_df = X_prod.copy() # features
analysis_df['y_pred'] = y_pred_prod # prod predictions
analysis_df = analysis_df.join(df['Date']) # date

To use DLE, we first need to fit it to reference to establish a baseline performance.

dle = nml.DLE(
    metrics=['mae'],
    y_true='Weekly_Sales',
    y_pred='y_pred',
    feature_column_names=features,
    timestamp_column_name='Date',
    chunk_period='w'
)

dle.fit(reference_df) # fit on the reference (test) data
estimated_performance = dle.estimate(analysis_df) # estimate on the prod data

We observe that No performance issues were detected, and the estimated performance was within the threshold values.

Mr. Danny doesn’t have ground truth data for the future Weekly_Sales values, yet he's receiving a comprehensive model performance report without experiencing false alarms.

After some days, when Mr. Danny gains access to the ground truth data Weekly_Sales (i.e., targets become available), we can calculate the actual model performance on production data, also known as realized performance. In the cell below, we calculate the realized performance and compare it with nannyML's estimation.

calculator = nml.PerformanceCalculator(
   problem_type="regression",
   y_true='Weekly_Sales',
   y_pred="y_pred",
   metrics=["mae"],
   timestamp_column_name='Date',
   chunk_period='w'
)

calculator.fit(reference_df)
realized_performance = calculator.calculate(analysis_df.assign(Weekly_Sales = y_prod))

In the above plot, the estimated performance closely aligns with the realized performance, indicating that the DLE’s estimate was accurate.

What leads to false alarm?

[part-1 Recap] After deploying his ML model, Mr. Danny received numerous alerts related to model performance. When the Mr. Danny received the ground truth value (the actual ‘weekly_sales’ of his store for that week), he discovered that most of the alerts were false. more than 90% of the alarms were false and only 10% of alarms correctly indicated a decline in model performance. As a result, focusing on data drift as the central monitoring strategy proved unsuccessful.

We will now examine univariate and multivariate drift for this data and explain why this failed in case of Mr. Danny.

drdc = nml.DataReconstructionDriftCalculator(
    column_names=features,
    timestamp_column_name='Date',
    chunk_period='d',
)

drdc.fit(reference_df)
multivariate_data_drift = drdc.calculate(analysis_df)
multivariate_data_drift.plot()

The multivariate drift method gave us an alert for the analysis data. Which is false alert as it doesn’t affected the model performance.

udc = nml.UnivariateDriftCalculator(
    column_names=features,
    timestamp_column_name='Date',
    chunk_period='w',
)

udc.fit(reference_df)
univariate_data_drift = udc.calculate(analysis_df)
univariate_data_drift.filter(period='all', metrics='jensen_shannon', column_names=['Unemployment']).plot(kind='distribution')

Similarly, the univariate drift method gave us an alert for the analysis data. Which is false alert. As it doesn’t affected the model performance. Hence we saw that false alarms occurred as data drift method was placed at the center of monitoring solution

Data drift detection is crucial, but it can distort the true performance of an ML model in production. It is recommended to use these methods in later stages of the monitoring process, such as in root cause analysis. Here, they serve as tools to identify and explain the factors impacting model performance.

Summary

We explored how nannyML can be applied to real-world Walmart data for retail sales forecasting, overcoming challenges of changing consumer behavior and market trends.

We learned the importance of model monitoring, but also how the issue of false alarms that can hinder effective monitoring.

If you came till here, I will strongly recommend you explore NannyML’s documentation and see how it can benefit in multiple use cases. You can also visit their website for more information.

References

nannyML Documentation: https://nannyml.readthedocs.io/

nannyML Blogs: 1. Don’t be Fooled by Data Drift | ML Monitoring Tools (nannyml.com) 2. ML Monitoring Framework: Model Performance at the Core (nannyml.com 3. Monitoring Strategies for Demand Forecasting Machine Learning Models (nannyml.com)

Dataset: Walmart Dataset (kaggle.com)

If you liked this article, do give it some claps 👏. If you have any doubt do ask in comment section.
Also, don’t forget to follow me for more articles on topics related to machine learning and data science.
Thank you for reading 😊