SHAPing up your understanding of Black Box Models

Harshita Sharma

Published in

Accredian

6 min readApr 17, 2023

Exploring Model transparency using SHAP to rev up your rental bike business

Introduction

In the world of artificial intelligence, black boxes are no longer just for airplanes. As we continue to integrate AI into our daily lives, it is becoming increasingly important to understand how these algorithms make decisions. That’s where explainable AI comes in.

In my previous articles we discussed about XAI quite a bit

Peeking Inside the Black Box: An Introduction to Explainable AI

Taking a look at the inner workings of AI and how Explainable AI is Enhancing Trust and Understanding

medium.com

Why Should I Trust You? The Ultimate Question for Our AI Overlords

A journey to discover why I was classified as a potential stoke individual using Explainable AI

medium.com

We will take a look at another one of the techniques today. In this article we will look inside the blackbox model we build for calculating the count of a rental bike business and uncover the features impacting the decisions to better understand the process of decision making with the help of SHAP.

SHAP (SHapley Additive exPlanations)

SHAP is a popular XAI technique for understanding and interpreting the output of machine learning models. It provides a way to measure the contribution of each feature to the final prediction made by the model. SHAP values are based on the Shapley value, a concept from cooperative game theory that measures the contribution of each player in a cooperative game. In the context of machine learning, SHAP values measure the contribution of each feature in the prediction of the target variable.

In this article we will look inside the blackbox model we build for calculating the count of a rental bike business and uncover the features impacting the decisions to better understand the process of decision making.

Peeking inside the Blackbox

The dataset used in this study is from kaggle about predicting the count of the rental bikes used.

After performing a fair amount of EDA and feature engineering we come to the part of model selection. Being a regression problem, I tried different models only to find that random forest performed the best on the dataset. So let’s take a closer look at that.

Finding out the best parameters for an optimal model using hyperparameter tuning.

def tune(X_train, X_test, y_train, y_test):
  param_grid = {
      'n_estimators': [100, 200, 300, 400],
      'max_depth': [None, 10, 20, 30],
      'min_samples_split': [2, 5, 10],
      'min_samples_leaf': [1, 2, 4],
      'max_features': ['auto', 'sqrt', 'log2']
  }
  
  # create a random forest regressor object
  rf_reg = RandomForestRegressor()
  
  # create a grid search object
  grid_search = GridSearchCV(estimator=rf_reg, param_grid=param_grid, cv=5, n_jobs=-1)
  
  # fit the grid search object to the training data
  grid_search.fit(X_train, y_train)
  
  # print the best parameters and best score
  print("Best parameters:", grid_search.best_params_)
  print("Best score:", grid_search.best_score_)
  
  # use the best estimator to make predictions on the test data
  y_pred = grid_search.best_estimator_.predict(X_test)
  
  # calculate the mean squared error
  mse = mean_squared_error(y_test, y_pred)
  print("Mean squared error:", mse)




#finding the best parameters
tune(X_train, X_test, y_train, y_test)

Interpretation using SHAP:

#!pip install shap

#importing module
import shap

# train a random forest regressor model
rf= RandomForestRegressor(max_depth= 20, max_features='auto', min_samples_leaf= 1, min_samples_split=2, n_estimators=400)
rf.fit(X_train, y_train)

#feature names
names=['year', 'hour', 'season', 'holiday', 'workingday', 'weather', 'temp',
       'atemp', 'humidity', 'windspeed']

# Create explainer object
explainer= shap.TreeExplainer(rf)

#Generate shap values (using the 1st 20 data)
shap_values= explainer.shap_values(X_test[:20])

#generate summary plot
shap.summary_plot(shap_values, X_test[:20], feature_names=names)

Summary Plot:

From the above plot it’s clear which feature affects the output the most, in this case the hour. The colour gradient signifies the value contributed by the feature.

We can even change the kind of plot in order to examine the data in a personalised way:

shap.summary_plot(shap_values, X_test[:20], feature_names=names, plot_type='bar')

Force Plots:

For a more local explanation of a particular feature shap gives and option to plot a force plot. It gives us an interactive graph which shows how the prediction is changed when each feature value is changed.

#this plots a force plot of first 20 values
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[:20], X_test[:20], feature_names=names)

Force plot can also be used for one single prediction:

shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[9], X_test[9], feature_names=names)

The force plot allows us to explore the behavior of the model in a single prediction, providing insights into how each feature contributes to the prediction, and how changes to the feature values can affect the prediction.

The plot consists of a vertical axis that represents the features of the data, and horizontal bars representing the contribution of each feature to the final prediction. The bars are colored according to the feature values, with red indicating high values and blue indicating low values.

Dependence Plots:

By analyzing the dependence plot, one can understand how the model’s output changes with the target feature’s value.

The x-axis represents the values of the feature of interest, and the y-axis represents the SHAP values for that feature. Each point on the plot represents an instance in the dataset, and the color of the point represents the value of another feature, which can be used to highlight or cluster different groups of instances.

# plot the dependence plot for the "temp" feature
shap.dependence_plot("temp", shap_values, X_test[:20])

This indicates that bike rental businesses could consider offering discounts during off-peak hours to balance demand and supply and should pay close attention to temperature fluctuations to optimize bike rental availability and demand.

There are many other shap plots which can be used for specific purposes to enchance the results and solve the problem on hand. You can check that out right [here].

Conclusion

In conclusion, hour turned out to be the most important feature in the dataset and setting up the rental bike business according to that would affect the business in the way one wants the most.

Due to the feature importance and its effects it gives us an understanding of these factors which rental bike companies can use to make more informed decisions about pricing, marketing, and supply chain management.

Final Thoughts and Closing Comments

While trying to SHAPe up our understanding of the blackbox we can see why interpretability and explainability play a huge role in the machine learning domain.

At this point it’s safe to say that SHAP is not just another funky-sounding acronym in the world of AI but a powerful tool that allows us to peek under the hood of complex models.

By SHAPing up our understanding of our models, we can avoid the pitfalls of blindly trusting them and make sure they work for us, not against us. So go ahead, add SHAP to your AI toolkit and start unraveling the mysteries of your models. Trust me, your data will thank you.