How to Understand Your Machine Learning Model

What should I use to understand the features and their importance to my model?

Matheus Pessoa
Analytics Vidhya

--

Photo by Safar Safarov on Unsplash

Machine Learning models are complex things to understand and evaluate. The features originated from the dataset and used to train the models are crucial to their performance.

Given the importance that these models have upon the decision making, it’s important to understand how each feature affects the model prediction. To do so, we are going to use some techniques as classical feature importance (just white box ML algorithms) and partial dependence.

In this article, we are going to use the famous lib scikit-learn and some techniques like RandomForest and DecisionTree to predict and evaluate the results from dataset. The data is the census income from UCI repository.

First things first. Let’s build a model and use some aspects to analyze the data.

Building the model

In this article, I going to reuse some codes from the notebook made by my professor. The code can be found here. So, the first part is to build a model and tune it. We will search for the right algorithm among three options Decision Tree, Random Forest or Bagging Classifier. All these options can be found on the scikit-learn lib.

Before I show the code, it’s important to explain what is a bagging classifier. Bagging is a technique that subsample randomly a dataset with replacement. We create many subsets that can have repeated data (because of the replacement) and use these to train different predictors. With bagging, the most common estimator used with bagging is Decision Tree. This estimator is easily overfitted when used separately. You may ask “But if the decision tree is easily overfitted, what guarantees that will we obtain the right answer on a different dataset?”. Well, it’s exactly what the bagging does. Overfit means that the model has a high variance (it can’t predict well on a new data) and a low bias (the predictor is quite sure that its answer for the training data is right).

When we aggregate the predictions across statistical mode, we are creating a smoothed curve on the variance. Using the most frequent prediction, we try to guarantee that models that know the data predict it right. The others have a high variance when the data is new and will be wrong.

Bagging is a really clever technique to avoid overfitting. A paper that explains it better can be found here. Now, we can see the code to select the best model. I’m showing just a part of the code. The complete code can be found on the repository of my professor on GitHub.

It took almost 40 minutes to search for the best model on Google Colab. After this process, the result is presented in the figure below.

Best model and params

The BaggingClassifier was the best model found. I forced the bootstrap with replacement in the configuration. This first part is just to show what bagging is and how it can improve classification and avoid overfit. In the second part of the article, we will build a model that uses bagging and evaluate the results of the training.

Feature Importance

One of the important things about machine learning is feature selection. The understanding of which features are relevant and how each of them impacts the model decision helps to reduce the complexity of the model through feature reduction, avoid biased models and helps to interpret better the results to increase confidence in the model predictions.

We going to compare here the classical feature importance from the ensemble algorithms provided by the scikit-learn with partial dependence, local interpretation and SHAP values. The classifier used here is the Random Forest. We choose it because the random forest is an algorithm that allows bagging and with its own feature importance evaluation.

First, let’s build the model.

I reused the pipeline created before to convert the data to the right format. In the “train_test_split” the result is keeping the proportion of the classes on the dataset across the param stratify. The predictor is a RandomForest optimized with GridSearchCV. I choose this model instead the Bagging showed on the first part because RandomForest has a property that lets you get the features importance.

After train the model, let’s evaluate using the oob_score and the test data and compare if both results are similar.

Evaluation of the model

In the image above, we can see the complete evaluation of our model. The first line it’s the validation with the out-of-bag values (the values that weren’t chosen when the dataset was subsampled). The second output is the accuracy of the model with the test set. As expected, the oob_validation is close to the test evaluation, which makes oob a good choice to validation.

With the model ready, we can start to analyze the importance of the features. First, let’s see what the scikit-learn give us with the standard analysis.

Feature importance

The graph above shows the importance of all the nine features used by the model. The four most important are “marital_status”, “capital_loss”, “capital_gain” and “workclass”. The method used to calculate these importances is the average gain of the feature when it is used in trees.

Let’s compare the default result with ELI5 evaluation. To do that, we just need to call the function “show_weights” with the classifier and columns name as parameters.

Features weights

The result obtained with ELI5 and showed above is the same as the default gave by scikit. It means (and the documentation confirms) that ELI5 uses the “gain” as default parameter to compute the importance.

Another excellent thing about ELI5 is that we can evaluate the feature weight in the decision. Bellow, I show the weight of each feature in the value predicted by the model.

Explanation of prediction

ELI5 is a great framework to evaluate models and understand the importance of the features, but it’s limited to linear models and tree-based models. Another limit is that ELI5 doesn’t partial dependence analysis. So, to overcome that, let’s use the Skater framework.

Skater is a model agnostic framework that offers a great set of tools to analyze feature importances of black-box models. Here, we will explore the partial dependence plot of our most important features.

First, let’s plot the rank of the features.

The rank of the features

The plot is similar to what ELI5 computed too. The most important features are the same. Now, we can plot the partial dependence of some features and understand the behavior. Partial dependence is a technique that allows understanding the impact of a feature on model prediction, keeping the other values constant. It can show if the relationship between the target and the feature is linear, monotonic or more complex.

The most important feature is the “marital_status”.

Marital status partial plot

In the figure above, the most frequent status of income greater than $50K are divorced (0) and separated (5), accordingly to the model.

The second most important feature is capital gain. In the figure below is the graph presents the normalized gains on the X-axis. Values greater than $7k presents more chances to have a high income.

Capital gains partial plot

The last partial plot shows the “workclass” importance. It seems like that who works to the government, federal employee (1) and local government employee (2), have more probability of a high income.

Workclass importance

Conclusion

This article is part of the evaluation of the machine learning class and here I tried to cover the basic parts about techniques to avoid overfit and how to analyze a model and understand the features of the dataset and how each of them impacts the predictions of the model. I’m opened to comments and critics about what I should do to make better.

Thanks.

--

--