Understand Your Black Box Model Using Sensitivity Analysis — Practical Guide

How each feature affects my model’s predictions

6 min readJan 28, 2019

Interpretability of models has become an extremely popular topic in the recent years and there are many researches in the field. Data scientists today are required not only to create a model with excellent performance, but also to explain certain aspects of the model. Understanding the model is crucial to validate its correctness, detect bias or leakages, and even learn new patterns in the data. This task can be sometimes very complicated, as a complex model is required to get state of the art performance and this usually comes at the price of interpretability.

https://pixabay.com/en/view-eyes-by-looking-woman-frame-1782619/

Interpretability can be categorized into two: global interpretability which gives explanations about the behavior of the model over the entire population, and local interpretability which gives explanation regarding a specific prediction.

In this post I will present a technique for global interpretability of black box models — feature sensitivity. What I love about this method is that it uses only the model’s prediction function, and thus can be used to interpret practically any model. I will explain the technique and how to implement it in production.

Sensitivity analysis

A simple yet powerful way to understand a machine learning model is by doing sensitivity analysis where we examine what impact each feature has on the model’s prediction. To calculate feature sensitivity we change the feature value or try to ignore it somehow while all the other features stay constant, and see the output of the model. If by changing the feature value the model’s outcome has altered drastically, it means that this feature has a big impact on the prediction.

Formally, given a test set X, we would like to measure the sensitivity of feature i. We create a new set X* where we apply a transformation T over feature i. We perform prediction on X and denote the prediction vector as Y. We perform prediction on X* and denote the prediction vector as Y*. To measure the change in the outcome we use our score metric while using Y as the true y. We let S be the original score — this is the score of the model on X (for accuracy for example, this will be 1) and S* be the new score, the score after changing the feature value. The sensitivity for feature i will then be S-S*.

Sensitivity analysis calculation process for feature i

Let us note that there are methods for feature importance such as Correlation Feature Selection and Mutual Information that use a mathematical calculation to get the importance of a feature. Those methods can be used to select features before running the model but they don’t use the model itself in their calculation. Other methods to calculate feature importance are applicable for a specific estimator, such as Random Forest Feature Importance. Those methods can’t necessarily be used for black box models.

Which transformation T should I use?

We would like to measure the change in the prediction after changing the feature value, however, different transformations result in different changes. I will describe three transformations, while each has it’s own advantages:

Uniform distribution — replace the feature value with another one from the possible feature values with uniform probability. Notice that in this case the sensitivity measure is affected by all possible feature values equally. Lets have a look at this example to illustrate an issue that should be considered when using this transformation: we have a numerical feature, age, where the values range between 0 to 120 but most of the data consists of teenagers ages between 16 to 18, and changing the feature within this range doesn’t affect the prediction but changing it to be outside this range does affect the prediction. If we use uniform distribution we will get high sensitivity for this feature although most of the time this feature won’t affect the prediction.

Permutation — permute the feature values. By using permutation we use the real distribution of the feature values in the data. In this approach the sensitivity measure will mostly be affected by values that appear more in the data. The main advantage in doing so is that the result we get consider the population of the data. An issue that may occur here is that a skewed feature will get low sensitivity although changing the feature will actually affect the data.

Missing values — try to simulate that the feature doesn’t exist in model. In models such as neural network you can do it by insert zero. Alternatively, you can use the mean for numerical feature, new class for categorical feature, value with the highest probability, or any other way you use to impute your data.

Production Considerations

Feature sensitivity analysis requires calculation of many predictions. To be exact, n_samples x n_features predictions, were n_samples is the the number of samples in our test set and n_features is the number of features. We can use batches to reduce this number but there still be many predictions to calculate, and many algorithms such as Random Forest require a long time to perform prediction. There are a couple of ways to overcome this issue:

Subsampling — using a couple of thousands of samples while using a simple splitting strategy as stratified splitting will mostly be sufficient.
Parallelization- We can run predictions simultaneously to use multiprocessing to increase the prediction rate. In production, we are often limited by the amount of RAM which can be used. Determining the maximum #processes which can be used can be tricky in such cases. We can initially perform a few batch predictions in a serial manner. These can be used to approximate the memory needed for a prediction, and then use as much threads as possible (without breaking our memory limitation).
Multiple stages — Finally, in case we have a lot of features, we can further reduce the amount of predictions by calculating feature sensitivity twice. In the first time we use a small amount of samples (up to a couple of hundreds). This get us a sensitivity measure for all features but it’s relatively inaccurate because we use only a few samples. We then filter the best features and recalculate sensitivity analysis for them over all test set (or the subsampled set). This way we get a reliable sensitivity measure for the most important features, which is what we need.

Real-world example

I work as a data scientist at Firefly.ai, were we build automatically models that can consist of ensemble of models, each with its own pipeline of imputation, feature engineering, selection, and estimator which are selected out of hundreds of algorithms. I encountered with Costa Rican Household Poverty Level Prediction competition from kaggle, where we would like to predict the income of households in Costa Rica. I run this competition using our automl system and got recall macro score of 0.9. To understand what the model learned lets have a look at the sensitivity analysis graph (created using permutation transformation) of the top-10 features. The sensitivity values are normalized to sum to 100. The graph shows that the two most important features for the model were SQBdependency (working-age population) and meaneduc (average years of education for adults). The importance of those features make sense, but the magnitude of the importance relatively to other features such as number of rooms is very surprising.

Costa Rican household poverty level sensitivity analysis graph

Now, I can use the insights from the analysis to get even better model. For example I can remove the features that have very little affect on the model, or I can remove features that affect the model and I think may cause to overfitting. If there is a leakage feature, it will be shown as important feature in the graph so I’ll see it immediately. Also, I can investigate more the patterns that I saw, for example how SQBdependency affect the model, meaning what are the ranges of working-age population that the model predicts high income, etc.

Last words

Feature sensitivity is a very easy-to-use and intuitive technique to understand which are the features that affect most the model. There are more advanced methods for global interpretability such as PDP, which allows finding out trends.

Other methods of model evaluation deal with local interpretability, namely understanding the prediction of a specific example. Many times there comes the need to explain a particular instance, for example to understand why a model predicts that one shouldn’t get a loan. LIME and SHAP address this issue.