Shapely Values and Pandas Dataframes

Justin Swansburg
3 min readApr 7, 2023

--

Better interpret and explain your ML predictions with this trick

Starryai.com

Explaining and interpreting the output of ML models is crucial for building trust and informing better decision making. While several techniques exist to explain model predictions, one of the most popular is the use of Shapley values. Originating from cooperative game theory, Shapley values have been repurposed by data scientists to measure the contribution of each feature to a model’s overall prediction.

Despite their usefulness, Shapley values can be challenging to visualize and interpret, particularly when dealing with a large number of features and predictions. I’m going to show you a clever way to visualize Shapley values in a tabular format, making it easier to communicate the driving factors behind your model predictions and to spot trends across observations.

The idea behind this visualization technique is to present model predictions in a table alongside their input features, with table cells highlighted based on the strength of the Shapley value. This approach allows you to quickly scan across predictions and visually grasp which features are pushing predictions up or down. Here’s what the final result looks like:

Dataframe with Shapely value highlighting

Check out this quick clip to see the table live. To implement this approach, follow these steps:

  1. Create a table with input features and predictions: Begin by constructing a pandas dataframe that contains your input features as columns and your model predictions as rows.
  2. Compute Shapley values: Utilize a library or tool that supports Shapley value computation, such as the SHAP library in Python, to calculate the Shapley values for each feature and prediction. These values will be used to color the table cells in the next step.
  3. Highlight table cells based on Shapley values: Assign a color gradient to the table cells based on the corresponding Shapley values. For instance, you can use a diverging color scale with red representing positive Shapley values (i.e., features that increase the prediction) and blue for negative values (i.e., features that decrease the prediction). The intensity of the color should reflect the magnitude of the Shapley value, with stronger colors indicating a more significant impact on the prediction.

Let’s take a closer look at a few values:

This particular example was built from a churn model that I discuss in this post.

If you look at the rightmost column, we can see that the customers with lower product usage (in this case the number of unique users in the past month) are highlighted in red, whereas the customers with higher product usage are highlighted blue. This highlighting tells us that our model is outputting a higher probability of churn for customers with fewer users. Makes sense!

Another major benefit of visualizing Shapely values across a table is the ability to detect feature interactions. Interactions play a critical role in the performance and interpretability of machine learning models. In the context of predictive modeling, an interaction occurs when the combined effect of two or more features on the target variable is different from the sum of their individual effects.

If you take another look at the table you may notice that some of the prediction strengths are different for customers that purchased the same starter pack. Some of them are highlighted red and others have little or no highlighting at all. This difference is an indication that the model has captured an interaction between that feature and one or more other features.

I’ll show you how to dig in and uncover these interactions in my next post. In the meantime, go here to see the code and follow me on Medium and LinkedIn for more helpful data science tips. Thanks for reading!

--

--

Justin Swansburg

Data scientist and AI/ML leader | VP, Applied AI @ DataRobot.