Feature Importance in the Age of Explainable AI/ML

Published in

SBC Group Blog

7 min readJul 12, 2020

“By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.” — Eliezer Yudkowsky

Explain or Not to Explain? (A photo is courtesy by Andrea Piacquadio from Pexels)

Why to Explain AI/ML?

In the recent years, we see increasing demand in interpretable AI/ML. Human decision-makers would like to trust their AI-based decision support on the ground of rationale rather than via religious belief in what AI system calculates/suggests/forecasts.

It was quite easy in good old days when the stage was preoccupied by the easily interpretable ML algorithms like (Linear regression, polynomial regression, logistic regression, CART-based decision trees etc.).

However, such algorithms lacked accuracy in many real-world business forecasting and decision support scenarios. It resulted in the advent of highly accurate and complicated algorithms (starting from Random Forests through Gradient Boosting Machine-like algorithms up to the wide spectrum of Neural Networks of the second generation). However, the accuracy came at a price. There was no more easy way to interpret the decision-making flow of such AI/ML algorithms in a human-friendly and rational way.

One of the early attempts to address the challenge was adding the supplementary capability to calculate the feature importance scores by some of the modern algorithms (this is featured, for instance, by Random Forest, GBM, and lightgbm).

However, the feature importance scores are sometimes confusing / misleading due to the fact they are calculated separately from the ML model training itself.

Such a collision gave a birth to several analytical algorithms to calculate the feature importance / do the feature selection for ML models. As opposed to the classical statistical (filtering) approaches (where feature importance is determined on a basis of a certain statistical metric, whether it is a Pierson correlation or Chi Square), such techniques embrace a series of model training experiments under certain feature space tweaks. In such a way, they relatively score the importance of each feature for a specific model to train.

In this post, we are going to review such feature selection / feature importance detection methods. They all will be useful in the strive to build the industrial culture of interpretable AI/ML.

With this direction, we are on a par with the industry giant like Google (who recently launched the services of Explainable AI — see https://cloud.google.com/explainable-ai).

Note: in addition to building comprehensive interpretable ML models, the relevant feature selection will also help you to handle other ML challenges as follows

Dropping the garbage (non-informative) features from your modelling pipeline
Tackling the curse of dimensionality as well as minimizing the impact of the model overfitting

What’s In Your Toolbox?

In real machine-learning problems, tackling the curse of dimensionality as well as increasing the model interpretability are translated to solid feature selection techniques.

Apart from the filtering-based methods (correlation- or chi-square-based) or PCA (the latter is mostly applicable to linear regression problems), there is a set of analytical (computational) feature selection methods.

In the subsections below, we will review the available analytical feature selection options, along with the technical implementation details for them in Python.

Wrapper-based methods

Wrapper-based methods treat the selection of a set of features as a search problem.

RFE and its implementation in sckit-learn can be referred to as one of the good options to go on with it, for example.

The utility function to benefit from RFE feature importance score calculations is provided below

Other options in the wrapper-based feature selection domain are Backward Elimination, Forward Selection, and Bidirectional Elimination.

Embedded methods

Embedded methods are applicable to ML algorithms that have built-in feature selection methods. For instance, Lasso, Random Forest, and lightgbm have them.

From the technical standpoint, feature selection with the embedded methods relies on scikit-learn’s SelectFromModel. You can see it in action with the demo code snippet below

Permutation method

Put simply, this method changes the data in a column and then tests how much that affects the model accuracy. If changing the data in a column drops the accuracy, then that column is assumed to be important.

You can benefit from the out-of-the box scikit-learn utility to facilitate permutable feature importance calculation. I wrapped it up in a utility function below

Drop-Column Method

This method is focused on measuring performance of a ML model on a full-feature dataset of predictors vs. the set of smaller datasets (each one of them to drop exactly one feature from the original fully featured dataset). In such a way, the difference between the performance of the model on the original dataset and every dataset with a dropped feature will be a metric of the dropped feature importance for the model performance.

I could not find the stable Pythonic implementation of such a feature selection / feature importance measurement therefore I ended up with the custom implementation below

Several Practical Considerations

When deciding which feature selection method is the best one for your specific problem, you should keep in mind the points below

there is no a single silver bullet-proof method of feature selection that worsks well for each and every project — typically, you will always have to undertake several feature selection experiments (using different methods) to figure out which one leads to the ML model with the highest performance metric score
all methods except filter-based ones have its computational time tall, and it may take too much time to go through them appropriately for large datasets
embedded methods sometimes introduce the confusion (or even a misinterpretation on the feature importance) as the feature importance scores are often calculated separately from the model training

Getting Your Hands Dirty: Selecting Features for Advanced House Pricing

We will illustrate the importance of feature importance/appropriate feature selection in real-world ML projects on the show-case example of the Advance House Price Prediction problem.

You can review its code fully in the notebook.

The major ML experiment setup/plan implemented in the notebook is as follows

do the feature engineering and data preprocessing as invented here
calculate feature importance scores with different analytical methods (model-calculated feature important coefficients, embedded feature selection from the model, wrapper-based feature importance score with RFE, permutable feature importance scoring, and drop-column feature importance scoring)
train a bunch of Random Forest models (both using the full set of predictors in the training set as well as the top features selected by each of the feature importance scoring algorithms above)
concatenate the model training/testing scores in a single Pandas dataframe
compare performance of the models trained, to choose the one to indicate the least CV test score (which is in fact average RMSE on the hold-out folds in n-fold CV rounds)

Results Discussion

We will find that

almost every model we trained (except for the set of models to use the tiny subset of features selected via embedded feature selection method)
feature importance scores calculated by RF algorithm directly are really misleading, and they do not reflect the actual feature importance for the model trained
the reason why embedded feature selection led to a poor result (see the chart above) can be explained by the fact RF needs reasonable variance in the feature space (to train diverse set of poor predictors — individual decision tree model estimators — to represent the regression problem complexity in a proper way)
among the bunch of models we trained, the best performance on the CV testing sets (withe testing score of 0.127509) was demonstrated by the model that used top 50 RFE features and n_trees = 800
in this case, the appropriate feature selection not only improved the interpretability of our model but also added the edge in its forecasting performance (vs. the set of RF models that used the entire set of features in training)

Summary

“What is vital is to make anything about AI explainable, fair, secure and with lineage, meaning that anyone could see very simply see how any application of AI developed and why.”

Ginni Rometty, IBM’s CEO, made that statement during her keynote address at CES on January 9, 2019. The background of asserting the need for explainable AI is that keeping it as a sealed black box makes it impossible to check and fix biases or other problems in the programming.

As we can conclude from the case study above, the solid feature importance scoring with analytical methods (as opposed to statistical and embedded methods) can draw the solid intelligence on which features are really important for your ML model. If applied in a wise manner, they will lead to both the explainable ML solutions and better performance of your ML models.

References

You can refer to the blog posts below if you like to undertake the deeper dive into industrial feature selection/feature importance calculation techniques

Feature Selection with pandas — https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b
5 feature selection methods every data scientist need to know — https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2
Feature Importance May Be Lying to You — https://towardsdatascience.com/feature-importance-may-be-lying-to-you-3247cafa7ee7
Beware Default Random Forest Importance — https://explained.ai/rf-importance/
Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson
Feature Engineering for Machine Learning by Alice Zheng and Amanda Casari