Interpretable Machine Learning : An attempt to demystify the black-box

pavitra srinivasan
Walmart Global Tech Blog
6 min readMay 6, 2019

Accuracy vs Interpretability paradox

Data scientists often have to deal with the classic accuracy vs interpretability paradox while choosing a particular predictive modeling technique. The traditional OLS (Ordinary Least Square) based techniques such as linear and logistic regression provide insights (coefficients) that are easy to interpret and clearly explain the incremental impact of a predictor variable on the response variable (the variable that you are trying to predict ). In other words, it is easy to understand the expected impact a 1 unit change in your input will have on your response variable. Typically, OLS techniques perform well when the decision boundary is linear, but this might often not be the case. Under such scenarios, tree based ensemble techniques such as random forest, xgboost etc outperform traditional OLS models in terms of predictive power. However, the major challenge with tree based models is the difficulty in interpreting how a particular feature will impact individual predictions atleast directionally. It is for this reason that tree based techniques, although are a widely used form of ML algorithms, are often considered a black box.

What is SHAP and why is it useful ?

The feature importance plots supported by tree based models shed some light into relative importance of the input features. Although these plots generalize well, they do not help understand the impact a feature has on a given prediction. This is where the true power of SHAP (Shapely Additive Package) comes in handy. With the help of plots supported by the SHAP package, we can visualize the individual predictions along with the impact of all the input features on a particular prediction. SHAP package has its roots in game theory approach where each prediction is treated as the outcome of a game played by the feature interactions.

How does SHAP determine feature importance ?

For a given prediction, each of the feature is assigned a score called the shapely value which determines the contribution of the feature to the specific prediction. The shapely value for a feature is calculated as the average marginal contribution of a feature relative to all possible permutation sets of the input features. In other words, predict the response variable for all permutations of input features with and without the input feature, for which the score is to be determined, then take an average of the observed differences in order to get the shapely score. Also for a given prediction, the shapely value of all the input features must add up to the difference between the predicted value and the baseline value (i.e) average across the training dataset. For the mathematical details behind this calculation, please refer to this article.

A SEM (Search Engine Marketing) use case for SHAP :

A model trained using any of the tree based algorithm can be fed as an input to the tree explainer function in SHAP package. Although SHAP plots can be applied to deep learning based techniques the focus of this article is limited to tree explainer feature in SHAP. For the purpose of our analysis, SEM ad conversion rate prediction model built using xgboost technique was analyzed using plots from SHAP package. Over 60 + input features are used in this model. At a high level these features can be broken in to the following buckets :

· Historical SEM performance based signals — sem orders, sem clicks , sem adspend

· Historical site performance signals — site orders , site revenue

· Item attribute related signals — item price, ratings

The standard feature importance plot from xgboost model is represented in the plot below. It gives the relative importance of features across the entire training dataset. In other words, from this plot we understand that “sem_convrt_w2” (historical sem conversion in prior 2 weeks) is learnt to be a better predictor of the ad conversion rate than “sem_clicks_w2” (historical sem clicks in prior 2 weeks) in a majority of the cases. While this plot helps understand the important features in a relative context, it does not provide insight into features that drive the individual predictions.

Note : Weight refers to the number of times a particular feature is used for splitting the decision tree nodes

With the help of SHAP package applied to the ad conversion rate prediction model, it is possible to understand features that drive the individual predictions. For instance, consider the below force plot from SHAP that is used to visualize feature importance for 2 different scenarios. In the force plot, the red and blue colors indicate the following

· Red — Features that push the predicted value above the baseline (average)

· Blue — Features that pull down the predicted value below the baseline

Scenario 1: Predicted ad conversion rate is less than the actual value. Hence the focus here is to identify features that pull down the prediction.

From the above plot, it can be understood that the lack of any “sem orders” (feature color coded in blue above) in the prior few weeks is considered more critical by the model, than the ad having clicks for the same period, that resulted in lower predicted conversion rate.

Scenario 2: Predicted value is greater than the actual value. Hence the focus in this case is to identify features that pushed the predictions beyond the baseline.

As observed, the model considers higher “sem conversion” in last 2 weeks (feature color coded in red above) as well as higher “sem orders” in the prior 1 week as key factors that led to predictions beyond the baseline.

Notice how the same feature “sem_orders_w1” drives the predictions differently in the above 2 scenarios. The standard feature importance plot is not able to get to this level of detail and does not provide much of an insight into how a particular prediction was arrived at.This causes ML algorithms to be a black-box for business teams that consume the output of such algorithms. With the help of packages such as SHAP, LIME to name a few, data scientists can provide increased visibility to the inner working of ML algorithms and thereby gain credibility from business partners. Further, it also serves as a tool the data scientists can use to diagnose noisy predictions and fine tune algorithm performance.

Besides the visualization of individual predictions, SHAP also supports dependence plots to visualize how a particular feature interacts with another feature and how these features together impact the output metric. The below plot illustrates how the “site add to cart page view” feature impacts the predicted “ad conversion rate”. It is observed, as expected, that the chances of conversion increases with more add to cart page views. Additionally, the color coding indicates the interaction between “site add to cart page view” and the “item price” features. The blue color indicates records with low price while the red color indicates records with high price. Thus, items with high price tend to have lower add to cart page view than items with low price, which in turn indicates lower chances of conversion thereby demonstrating the price sensitive nature of our shoppers. Likewise, it’s possible to choose any 2 input features and study their combined effect on the output metric.

Dependence plot showing interactions between site_add_to_cart and item_price features and their impact on conversion rate

About the team — Search & Display Marketing (SDM) Engineering and Science @ Walmart Labs is in charge of optimizing paid and free search for walmart.com. We are a highly motivated group of Big Data Geeks, Data Scientists and Applications Engineers, working in small agile groups to solve sophisticated and high impact problems.We are building smart data systems that ingest, model and analyze massive flow of data from online and offline user activity. We use cutting edge machine learning, data mining and optimization algorithms underneath it all to analyze all this data on top of Hadoop and Spark.

--

--