How important are you Mr. Feature?

Roma Jain
EdTech in-depth | iSchoolConnect
4 min readJun 8, 2020

Most of us need to interpret our models and not just train them and keep them in memory. And the most widely asked question is ‘How do I check my feature’s importance?’

Let’s take a real-life example. When we study for our exams, we have to go through multiple chapters in a subject. Some chapters are crucial ( asked 70–80% of the time) whereas some are the least important of all. How do we know that? Weights!! We see the weight of a chapter by looking at the previous exam papers. Now, why don’t we study all the chapters? Simply because it adds noise to our learning process. What if there is a chapter which is a mix of two chapters you have already gone through. Would you read that? No! These two we have already read through :P

Today, we will go deeper into how the importance of a feature is calculated and how the model’s accuracy metric gets improved.

If we randomly permute a feature and try to test it on an unseen data, the difference between the baseline metric and the changed metric when running predictions with the same model tells us how important it is for the model to have this feature as part of the input. If the difference is near zero, it means that even if the feature is randomly permuted or has the actual values it should hold, it does not affect the model’s performance. This means the model never really relied on this feature before for predictions.

Writing it from scratch will look like this :

If we drop certain columns and then calculate the difference in the metric with the same model but trained on a different set of features and tested on the same unseen data, this also gives us a fair idea that if we dropped a really important feature, then the goodness-of-fit metric will drastically drop. Whenever training happens, we set the same random state so we know that if there is any model performance change, it is because of the changes in the feature list only.

But one caveat I would like to mention here which most of the people overlook. These methods will work only if all the features are independent of each other. Not correlated with each other. In most of the projects, we have the domain knowledge hence we can drop the features which are correlated to each other. But in other cases, using Spearman’s rank correlation method. Now, why can’t we keep the highly correlated features?

Let’s change our data and duplicate one column. On doing that, what happens internally is during RF model construction, node splitting will choose equally important variables roughly 50–50 times. This means when you calculate the feature importance, the importance will be split equally too. While this may not be an issue during feature selection but it will affect model interpretability. It can lead to the incorrect conclusion that one of the features is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship.

Consider the original data with 6 features and its importance :

Now, lets duplicate f6 feature as ‘f6_duplicate’ and then calculate importances :

As you can see that, the original importance of feature f6 being 0.1 is roughly divided into 0.07 and 0.017.

This article for iSchoolConnect is the first among many more that I hope to publish in the Machine Learning space. Follow this publication for more:)

--

--