Improve business decisions with three machine learning interpretability tools
Machine Learning interpretability is a very hot research topic and new scientific papers and articles on this subject are published every week. This article presents three complementary tools that cover most of our Machine Learning interpretability needs at ManoMano where we help more than one million Europeans every day to find what they want in our catalogue of 3 million DIY and gardening products:
- Feature importance: which features does the model use the most?
- Partial dependence plot: how does the model use a specific variable?
- Feature contribution: why did the model make this specific prediction?
Why should we interpret our machine learning models?
- Understand your business: Which factors drive the website conversion rate? What is the impact of the shipping price on it?
- Create trust with business counterparts: Why is this sales forecasting prediction so high? Because the product was sold 157 times last week and seasonality for its category is growing.
- Take relevant actions to solve a problem: Why is this customer likely to churn? Because he had quality issues last month.
- Debug/improve the model: Why is this sales forecasting prediction so high? Missing values in the production dataset. Why did we underpredict so much? Sales of this product were low because it was out-of-stock and we had not noticed.
Use case: conversion rate modelization using Gradient Boosting Trees
To illustrate the three tools, we focus on the product conversion rate modelization. Here’s what our training dataset looks like:
We are trying to predict the conversion rate of a product on a specific day, depending on product-based features (price, ratings, shipping time, etc.) and the day of the week. To simplify the following analysis, we artificially sample our dataset to have an average conversion rate of 10%.
We’re using LightGBM, an extremely powerful ensemble model using trees. This tool is widely used in Data Science competitions.
Feature importance is a tool to compute and quickly visualize how useful each feature in our model is. This tool is commonly used in ensemble models like Random Forest or Gradient Boosting Trees. The more a feature is used to make key decisions with decision trees, the higher its relative importance. For more details on how feature importance is calculated, you can refer to this blog post. Let’s visualize the feature importance plot for our predictive model:
According to this bar chart, the most important features to predict conversion rates are the product price, the shipping price, the ratings and the shipping time. It does coincide with our business intuitions, which is great.
Influence of missing values
Let’s try another setup, imagine we have a data quality problem, and for some reason, 90% of price values were missing. After retraining the model, here is what we get:
We could assume that the price is not an important feature, which is totally wrong. We should rather spend time improving data quality to improve our model. By the way, missing values can also contain information. For example, having a missing average rating simply means there are no ratings, and that the product is not popular or new.
Influence of correlated variables
Let’s add three correlated (and noisy) price features and see what happens:
Here the importance of our feature “price” decreased from 38% to 24%. The importance of this variable was distributed among its correlated variables. Moreover, adding noisy features increases memory and CPU usage and increases the risk of overfitting. Therefore, an important feature according to this tool may not be a necessary feature.
Pros and cons of feature importance
- (+) Very simple to implement, it’s a few lines of code
- (+) Quickly gives good insights on the signal
- (+) Efficient tool to detect data quality problems in the training dataset
- (-) Sensitive to features correlations
- (-) Sensitive to missing values
- (-) Does not provide the relation between the feature and the target function (in our example, we know that the price is very correlated to the target, but we don’t know in which direction)
Partial dependence plots
Partial dependence plots are useful to visualize the impact of a feature on the predicted target, marginalizing over the values of all other features. Such a tool helps to understand the correlation between the target and a feature, all other things being equal. For the complete mathematical demonstration, you can refer to this chapter of The Elements of Statistical Learning. Partial dependence plots can be used with every machine learning model. We recommend using the very complete PDPBox python package.
To illustrate partial dependency, let’s take a real-life example related to our use case. Assume the business owners want to know the impact of shipping prices on conversion rate. Knowing that the shipping price is very correlated to the product price, we start by computing the feature shipping_ratio which is the ratio between the product shipping price and its total price:
Let’s make a univariate analysis by computing the average conversion rate by the bin of shipping_ratio:
The resulting graph is pretty clear: the conversion rate is positively correlated with the ratio_shipping feature! Shall we communicate with the business owners and advise them to increase the shipping prices of all our products in order to boost the conversion rate? Of course not, because correlation doesn’t imply causation. Let’s make a partial dependence analysis on the same variable and observe the difference:
As expected and when we take into account all other features used by the predictive model, the shipping_ratio feature is negatively correlated with the predicted conversion rate. Note that we still don’t see real causation, but a correlation corrected from other features.
Pros and cons of partial dependence plots
- (+) Shows the relation between a feature and our variable of interest
- (+) Allows you to see the uncorrelated influence of a feature on another, compared to a standard univariate analysis
- (-) Time-consuming on large datasets
- (-) Limited to two-dimensional plots
- (-) Sensitive to feature correlations
Feature contribution computes the impact of each feature on a given prediction. It gives a micro understanding of each prediction. Like partial dependence plots, feature contribution can be computed regardless of the machine learning model used. You can refer to this blog post if you want to know how it is calculated for Random Forests. LightGBM predict function provides a parameter to compute them directly. Let’s predict one of the most popular products at ManoMano — a drill from Makita — and observe the contribution of each feature:
This product has a lot of good ratings (644 ratings, with a 4.69/5 average value). Therefore the contribution of the number of ratings to the predicted conversion rate is +12%. However, its price (167.99€) is above the average. Expensive products tend to have a lower conversion rate. Therefore its contribution to the predicted conversion rate is -7.5%. Note that the sum of the contributions is equal to the predicted conversion rate (we also need to add the intercept to the sum).
We can repeat this process for another product and observe the differences:
Even if this product has a predicted conversion rate similar to the previous Makita drill’s one, it has another “profile”: it is much cheaper (34.2€) and it is well rated but shipping time and price are dissuasive (6.9€ for shipping within 8 days).
Like the two previous methods, feature contribution is sensitive to features correlations: if you feed the model with two very correlated features, the prediction contribution will be artificially divided by two.
Pros and cons of feature contribution
- (+) Micro explanation of a prediction as a sum of feature contributions
- (+) Helps to investigate predictions to eventually preemptively detect model bugs or data quality problems
- (+) Build trust with business owners by explaining to them how the algorithm behaves
- (+) Allow prescriptive modeling (vs predictive modeling): explaining why a customer might churns helps to take the right actions
- (-) Sensitive to feature correlations
We have presented three tools that cover most of our interpretability needs at ManoMano. Some warnings if you use them out of the box:
- Highly correlated features are a nuisance for machine learning interpretability. You need to get rid of them before trying to interpret your model.
- Trying to interpret models with very poor predictive performance (e.g. 0.51 AUC) does not make sense. The usefulness of an interpretation is directly linked to the predictive signal captured by the model.
We hope this article will be useful for you and that interpretability concerns will no longer be a problem when using powerful machine learning techniques!
Written by Jacques Peeters and Romain Ayres.
Alexandre Cazé, Yohan Grember, Chloé Martinot, Marin De Beauchamp, Bryce Tichit, Raphaël Siméon, Thomas Charuel, Louis Pery, Cyril Auberger, Matthieu Cornec and all our great colleagues at ManoMano.