Calculating XGBoost Feature Importance

Emily K Marsh
8 min readJan 31, 2023

Exploring Three Different Feature Importance Methods

As a Flatiron data science student, every project is an opportunity to deepen my understanding of the concepts I am learning about. I also find that my learning is enhanced by answering questions I actually want to know the answer to. This is the path that lead me to explore the Data Science concept of calculating Feature Importance more in-depth.

My phase three project was building an XGBoosted model to predict the level of adaptivity that students had with online learning. There were fourteen independent features that were collected about each student and each student assigned themselves an adaptivity level of either low, moderate, or high. However, I didn’t want to just build a model to predict the adaptivity level of the student. I wanted to be able to use the model to see what features had the greatest effect on a student being a low or moderate adaptivity level versus a high adaptivity level. For me, coming from an online educational career before becoming a data scientist student this was the question that meant the most to me. How could I have helped my past students do better with online learning? Sometimes the technical complexity can get in the way of the real-world need that the project is trying to fulfill. In order to give a recommendation of how to help students, what Data Science tool would I need to use? That tool was feature importance.

Naively, I assumed that there was only one way to approach feature importance. Spoilers, there is not. Data Science is a science at the end of the day, and science does not deal with absolutes or right answers. After doing research on different methods I made a selection of three to try. These were: XGBoost Built In Feature Importance, Permutation Feature Importance, and SHAP values. Ultimately I decided to use the SHAP value method in my analysis because it was the only one that included feature importance broken down by the different categories of the dependent variable. But after completing the project I was left wondering if the SHAP value method was the best way to calculate feature importance. To answer this question I did a deeper dive into each of the three ways that I calculated feature importance in order to understand whether my selection of SHAP values was appropriate for the project. For those who would like to acquaint themselves with the project beforehand HERE is the link.

XGBoost Built-In Feature Importance Function

A benefit to using a gradient-boosted model is that after the boosted trees are constructed, it is relatively simple to retrieve the importance score for each feature. The feature importance is calculated for a single decision tree by the amount that each attribute’s split point improves the performance measure, weighted by the number of observations the node is responsible for. In other words, you are simply summing up how much splitting on each feature allowed you to reduce the impurity across all the splits in the tree. The feature importances are then averaged across all of the decision trees within the model.

As a built-in function, it is fairly simple to create code that graphs the feature importance of the model. I am including the code below that was used to calculate the Feature Importance and the bar graph to display this information.

There are some drawbacks. First was that the bar graph created shows feature importance overall, but does not give information about what categorical dependent variable it is affecting. More research would need to be done to determine which categorical dependent variable these features are important to. After doing some research, it seems that breaking the feature importance into the categories of the dependent variable is not an easy task. So just for this reason, the SHAP values may be the best way to determine feature importance for recommendations.

It seems that the built-in feature importance function is the most helpful when trying to optimize the performance metrics of the model. However, in this situation, I was looking for feature importance to make recommendations to the client. As a result, it seems that the built-in feature importance function was not the ideal metric to use for my project.

Further research led me to a medium article HERE written by Scott Lundberg. He goes into more detail about how the feature importance function built into XGBoost can be unreliable for a tree model. The function can calculate feature importance with three different metrics: gain, weight, and cover. To explore this, I tried running the function with all three metrics to see if they identified different features as important or if they stayed the same.

As can be seen, all three graphs provide three different answers to which features are the most important. This inconsistency across the three metrics is similar to what Scott Lundberg found with his exploration as well. This inconsistency begs the question, which metric (gain, cover, weight) is the best for the dataset? None was Scott Lundberg’s answer. Running an experiment he found that only the SHAP value method was both consistent and accurate as it is the only method that does a fair allocation of profits using game theory. Using gain to calculate feature importance leads to a bias towards splits lower in the tree. In contrast, Lundberg says, the tree SHAP method is mathematically equivalent to averaging differences in predictions over all possible orderings of the features, rather than just the ordering specified by their position in the tree. For this reason, it seems that the built-in feature importance method will be less useful than using the SHAP value method.

XGBoost Permutation-Based Feature Importance Method

Permutation Based Feature Importance calculation is done by randomly shuffling each feature and computing the change in the model’s performance. The features which impact the performance the most are the most important ones. The code and the subsequently created graph are included below.

Similar to the built-in function, this feature importance method does not break out the feature importance by the categories of the dependent variable. So, therefore, again, it looks like SHAP values will give the most relevant information for a recommendation. However, I was curious as to why different features were rated as the most important with this method versus the built-in method. For the built-in feature importance method, the top three features were zero hours of instruction, the financial condition of rich, and one to five hours of instruction. In contrast, for the permutation-based feature importance method, boy was found to be the highest important feature, then government educational institution, and then zero hours of instruction. How could these two methods disagree so much?

According to an article found HERE from mljar.com, the permutation-based method can have problems with highly-correlated features. One hot encoding used to process the categorical in this dataset will absolutely be highly correlated with each other. Therefore, using logic, the feature importance seems to have led to questionable findings and is therefore not the correct method to utilize for this project.

Feature Importance with SHAP Values Method

SHAP Values are determined by using Shapley values from game theory to estimate how each feature contributes to the prediction. The SHAP value itself is the individual contribution of each feature on the output of the model for each example or observation. Compared to using coefficients as a metric for the overall importance of each feature, SHAP values avoid distortions and misinterpretations that come when the features are scaled with the scale of the variable itself. SHAP values can also account for the local importance of the feature. The code and subsequently created graph are included below.

Finally, the problem of breaking down feature importance by dependent variable category is solved. By using SHAP values, it is easy to see in the graph which features are the most important features for each level of adaptivity ‘low’, ‘medium’, and ‘high’. According to the SHAP values, the most important features are 4G network speed for wifi, the financial condition of rich, and zero hours of instruction time. Being Rich and having zero hours of instruction time agree with the findings of the first built-in feature importance function’s findings. However, 4G network speed being the most important feature was a new finding. How did this new feature get a much higher score than either of the two previous methods?

After some research, an article that was particularly helpful was written by Denis Vorotyntsev on Medium found HERE. In this article, Danis runs an experiment in order to see which feature importance method is the most appropriate for a tree-based model. His experiment leads to his conclusion that permutation importance should not be used as it interpolates in unseen regions badly.

As for the built-in feature importance method, as was mentioned earlier, this method tends to be biased toward giving importance to lower splits. Comparatively, SHAP is a much more accurate and consistent method of determining feature importance.

Conclusion

Ultimately, my deep dive into feature-importance methods led me right back to the method I had used blindly at the beginning. The SHAP method is more accurate and consistent for tree models than either the built-in or permutation method. It is also the only method that provided information about features important to each category of the dependent variable.

Although some may perceive this research as a long road to ending up right where I started, I have seen it as an opportunity to deepen my knowledge and have more confidence in my understanding of determining feature importance in machine learning.

--

--

Emily K Marsh

Currently a Data Science Student at Flatiron School. Excited to bring my new DS skills to my passions of education and community outreach.