Failure Is Not an Option: How to Prevent Your ML Model From Degradation

Maciej Balawejder
NannyML
Published in
9 min readJan 12, 2023

Reliance on Machine Learning models is growing. Business decisions in banking, insurance, or real-estate organizations are driven by intelligence systems. Automatization brings increased profit but comes with a great risk of failure. In this blog post, we will try to understand Zillow’s $384 million fiasco and how NannyML software can help to prevent it.

Photo by DeepMind on Unsplash.

In the third quarter of 2021, Zestimate (Zillow’s ML model) started to sell houses for much lower prices than they were previously bought. Flip became a flop, and Zillow reported a loss of $384 million and laid off 25% of its workforce.

The damage is done; now it’s time to analyze the initial causes to prevent it from happening again.‍

What could possibly go wrong?

Fundamentally, we have been unable to predict future pricing of homes to a level of accuracy that makes this a safe business to be in. ~ Rich Barton CEO of Zillow

To date, Zillow hasn’t publicly shared any insights into what failed. There are plenty of reasons why the machine learning algorithm could have shown an inability to predict the outcome accurately. Besides the possible data quality issues, the main lesson is the heavy reliance on the algorithm itself.

As businesses become driven by the ML models, our tools should also be ready and up-to-date to prevent potential problems.

Zillow lost millions of dollars, but what if the stakes were higher? Like a model that hands down jail sentences or diagnoses a disease?

In the next section, we will learn how careful monitoring can help us prevent potential failures.

The importance of ML Monitoring

The real fun starts when the model is deployed.

One of the keys to having a successful ML project deployment is performance monitoring. Besides failure prevention, it also helps maintain its business value.

In many companies, it’s a common practice and a relatively straightforward task. The problem arises when the ground truth is not directly available.

Let’s go back to the house pricing prediction. It can take up to months before a house gets sold and the real price gets revealed. Only then we can access our predictions and evaluate our model. In the meantime, we need to have hope that our model performance won’t go off the rails.

What can we do about it?

The missing puzzle in that workflow is an estimation of the model’s performance. Instead of sitting and wishing for target values to come, we could perform an initial analysis and resolve the forthcoming issue. Model performance estimation sounds like a tricky idea, but it is possible to achieve. In the next section, we will go through this concept in detail along with the hands-on example using the NannyML.‍

What is NannyML?

The typical post-development ML workflow with NannyML’s operations.

NannyML is an open-source python library for post-deployment data science. This involves all the work after our model gets to the production stage, like monitoring its performance, business impact, and maintenance.

Let’s kick off with the basics! A simple setup requires only copying the following pip install command, and voila.

$ pip install nannyml

To simulate the real-estate use case, we will analyze the slightly modified Housing California Dataset. We will try to predict whether the house’s value is above the median average. Since our target value is now binary, we are dealing with a classification problem.

Data Requirements

Data generation process.

First of all, let’s clarify the types of datasets we are using in the Machine Learning project:

  • Training data — a historical set of data used to train the algorithm
  • Testing data — a historical, unseen set of data used to evaluate the performance of the model
  • Production data — real-time incoming data without the ground truth

NannyML uses two slightly modified versions of these subsets:

  • Reference set — testing data with model predictions
  • Analysis set — production data with model predictions‍

Alrighty, let’s kick off with loading the data!

import pandas as pd
import numpy as np
import datetime as dt
import nannyml as nml
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

cali = fetch_california_housing(as_frame=True)
df = pd.concat([cali.data, cali.target], axis=1)
df.head(3)
First three rows from the dataset.

Now, we need to enrich our dataset by adding the timestamp column. This simulates the real-world scenario where the older data is used to train and test while the newest examples are collected during production.

timestamps = [dt.datetime(2020,1,1) + dt.timedelta(hours=x/2) for x in df.index]
df['timestamp'] = timestamps

# add periods/partitions
train_beg = dt.datetime(2020,1,1)
train_end = dt.datetime(2020,5,1)
test_beg = dt.datetime(2020,5,1)
test_end = dt.datetime(2020,9,1)
df.loc[df['timestamp'].between(train_beg, train_end, inclusive='left'), 'partition'] = 'train'
df.loc[df['timestamp'].between(test_beg, test_end, inclusive='left'), 'partition'] = 'test'
df['partition'] = df['partition'].fillna('production')
# create new classification target - house value higher than mean
df_train = df[df['partition']=='train']
df['clf_target'] = np.where(df['MedHouseVal'] > df_train['MedHouseVal'].median(), 1, 0)
df = df.drop('MedHouseVal', axis=1)
del df_train
target = 'clf_target'
meta = 'partition'
features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

Let’s also train a simple RandomForrestClassifier to predict whether the house price is above the median or not.

df_train = df[df['partition']=='train']
clf = RandomForestClassifier(random_state=42)
clf.fit(df_train[features], df_train[target])
df['y_pred_proba'] = clf.predict_proba(df[features])[:,1]
df['y_pred'] = df['y_pred_proba'].map(lambda p: int(p >= 0.8))
# Check roc auc score
for partition_name, partition_data in df.groupby('partition', sort=False):
print(partition_name, roc_auc_score(partition_data[target], partition_data['y_pred_proba']))
# train 1.0
# test 0.9507469810165158
# production 0.900980984424309

Finally, we can split our data into reference and analysis sets.

# create reference and analysis set
df_for_nanny = df[df['partition']!='train'].reset_index(drop=True)
df_for_nanny['partition'] = df_for_nanny['partition'].map({'test':'reference', 'production':'analysis'})
df_for_nanny['identifier'] = df_for_nanny.index

reference = df_for_nanny[df_for_nanny['partition']=='reference'].copy()
analysis = df_for_nanny[df_for_nanny['partition']=='analysis'].copy()
analysis_target = analysis[['identifier', 'clf_target']].copy()
analysis = analysis.drop('clf_target', axis=1)
# dropping partition column that is now removed from requirements.
reference.drop('partition', axis=1, inplace=True)
analysis.drop('partition', axis=1, inplace=True)

Performance Estimation

Performance estimation is a key part of the ML monitoring process. There are two available options in NannyML whose choice depends on the type of target values.

The CBPE is the flagship algorithm tailored to the classification task. The idea behind it is to estimate the elements of the confusion matrix using the expected error rates. This slick solution provides us with the availability to estimate any performance metric.

CBPE Estimator workflow.

If you want to dive deeper into the technical aspects of this method, I highly recommend checking out the docs.

If you don’t like boilerplate code, NannyML will be your biggest ally. With just a few lines of code, we can get the estimated performance of the model.

cbpe = nml.CBPE(
y_pred='y_pred',
y_pred_proba='y_pred_proba',
y_true='clf_target',
timestamp_column_name='timestamp',
problem_type='classification_binary',
chunk_period='M',
metrics=['roc_auc'])
cbpe.fit(reference_data=reference)
est_perf = cbpe.estimate(analysis)

Before we get into visualizing the results, we want to elaborate more on the chunk_period argument. A chunk is essentially a single data point on the monitoring results. The argument value M denotes the monthly split of our dataset. Check out the docs to know more about chunks.

est_perf.plot()
Estimated performance graph.

As we can see, the drop below the performance starts at the beginning of 2021. This is an important signal to take a closer look at the input data.

Data Drift

Visualization of target values and model’s decision boundary for silent model failures.

We already established that our model is degrading, and now we need to find out why. The input data is a good starting point to look at.

The distribution of the features is changing over time in unforeseen ways as the world around them changes. The significant shifts in the distribution can make our decision boundary outdated. It is a potential cause of data drift, which is one of the two possible silent model failures. Hence, it is important to keep an eye on the distribution change between training and production data.

For that purpose, NannyML equips us with two detecting tools:‍

Multivariate method

The multivariate drift detection method.

This novel algorithm developed inside NannyML labs gives us a general overview of data drift. Firstly, the dataset is compressed to the latent space using PCA. Then reconstructed to the initial state with a certain error. This reconstruction error is used as a baseline for data drift detection. The increase or decrease in error signifies the change in the data distribution.

This simple solution allows us to look at the dataset as a whole and quickly detect if the problem lies in the data or somewhere else.

Let’s see how we can perform it on our dataset.‍

feature_column_names = analysis.columns[:8].append(analysis.columns[9:11])
calc = nml.DataReconstructionDriftCalculator(
column_names=feature_column_names,
timestamp_column_name='timestamp',
chunk_period="M"
)

calc.fit(reference)
results = calc.calculate(analysis)
results.plot()
Multivariate drift (PCA reconstruction error) graph.

As we can see, the alert signals from the multivariate drift graph overlaps with the drop in the estimated performance of the model. Now, we need to dig deeper to find the reasons behind it.

Univariate method

Univariate method calculates the change of the distribution for each feature separately. NannyML supports methods for both continuous and categorical variables. The main drawback of this approach is insensitivity to the change in correlations between features (butterfly dataset), which isn’t the case for the multivariate approach. This is why it is worth applying both methodologies.

univariate_calculator = nml.UnivariateDriftCalculator(
column_names=feature_column_names,
timestamp_column_name='timestamp',
chunk_period='M',
continuous_methods=['kolmogorov_smirnov'],
categorical_methods=['chi2']).fit(reference_data=reference)
univariate_results = univariate_calculator.calculate(analysis)

NannyML provides us with a ranking alert to quickly establish features that drift the most.

nml.AlertCountRanker().rank(univariate_results)
Alert ranking based on number of significant drifts for each of the feature.

We can see that the HouseAge, Population, MedInc features are on the top. Thus we will take a closer look at them.

fig = univariate_results.filter(
column_names=['HouseAge', 'Population', 'AveOccup', 'MedInc'],
methods=['kolmogorov_smirnov']
).plot(kind='distribution', number_of_columns=1)
fig.show()
Distribution graphs for MedInc, HouseAge, and Population.

As we can see, the MedInc and Population distribution is shifting and becoming more flat, indicating more variance in the data. On the other hand, the HouseAge distribution has a smaller variance, and the mean moves towards lower values.

As we can see, there are progressive shifts in the input data. The next step is to reason why they occur.‍

Manual Analysis

The data observations above are the reflection of what’s going on in the real world. Now, we need to put our data scientist’s hat on and reason about the root causes of these problems.

  • Why did the model performance decrease in those time brackets?
  • Are data drifts forming a pattern?
  • Why the distribution of those features changes significantly?

The possible explanation is the gentrification process, where educated individuals begin to move into poor or working-class communities. The rising cost of living and new houses disrupt the local real estate market. The drifts in the data are increasing over time, signifying that it’s an ongoing process. The median income is becoming more diverse, which implies more wealthy people are moving into the area. Another piece of evidence is the increasing number of new houses and the population around. Based on that information, we could start to work on gathering more data and retraining the model to adapt it to the new environment.

Obviously, the real-world investigation is much more complex, but I hope I painted a good picture of how important and simple it is to incorporate NannyML into the analysis process.‍

Conclusions

Once you make it here, you know that with NannyML, failure of your model will be caught early. I hope you now understand that careful monitoring and performance estimation is the key to a successful deployment. Also, I equip you with the basic knowledge of how to start using the NannyML toolbox with your project. If you are eager to learn more, check out other docs and blogs!

Happy monitoring!‍

Originally published at https://www.nannyml.com.

--

--