Feature Engineering — How to find Feature importance scores (Growth Hack)

OR Feature imp. scores on IrisDataset

Published in

Analytics Vidhya

3 min readNov 19, 2019

Feature engineering is a very misinterpreted term, when I first started making models, I was also getting confused about it and use to think this would mean an approach or bunch of practise or tools when deciding which features to build and which features to skip.

Confused ?

But as I worked more around it and started working on real world data points, I realised deciding what features to pick and then doing feature engineering on them is a lot more “Deciding as per Dataset” kind of a problem than just scaling the feature set or running some functions on it.

Sometimes you combine two features to make new features like Rooms per household and Household per neighbourhood become Rooms per neighbourhood , sometime you subtract one feature from another and other times you only pick the best 8 out of the 30 because the other 22 have less than 0.01% impact on your decision making.

Some people call it more art than science and I think they are right.

Feature engineering is a diverse and a much bigger topic to cover but there is a small and neat trick that I found out while I was working on some model’s and found it extremely useful for the models that I was trying to build.

Here’s my trick

Whenever I run a model on a classifier and train my classifier I usually prefer Random Forest as the first classifier of choice, not only because it is transparent and can show me how the machine is breaking my model.(Covered below)

DecisionTree Classifier — Working on Moons Dataset using GridSearchCV to find best hyperparameters

Decision Tree’s are an excellent way to classify classes, unlike a Random forest they are a transparent or a whitebox…

medium.com

But also because because I can check feature importance score of my entire model, I can make a couple of models of my choice and then calculate their importance score to find, what the machine is actually digging up.

For example if I was doing a feature importance score of a housing dataset, my scores will be

Rooms (unit) 0.09805484876235299
Neighbourhood (km) 0.021686162123673226
Median income ($) 0.44874930874833113
Garden area (cm) 0.43150968036564236

This represents

Rooms 9.8%
Neighbourhood 2.1%
Median income 44.8%
Garden area (cm) 43.1%

Which tells you that Median income and Garden area are most important features while Neighbourhood is the weakest feature class.

How did I do that ?

Let me show you.

After you have trained your random forest classifier or regressor i.e.

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)

and you have trained your data on this

rnd_clf.fit(X_train,y_train)

you can run feature_importances_ like

for score in rnd_clf.feature_importances_: 
     print score

For Iris dataset the feature importance score using the above function comes out as

sepal length (cm) 0.09805484876235299
sepal width (cm) 0.021686162123673226
petal length (cm) 0.44874930874833113
petal width (cm) 0.43150968036564236

If you want to run the code for yourself, just copy paste the following code in your notebook -

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifieriris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)X=iris[‘data’]
y=iris[‘target’]
rnd_clf.fit(X,y)Feature_score=zip(iris[“feature_names”], rnd_clf.feature_importances_)for name, score in Feature_score:
 print(name, score)

and voila you have a list of all feature scores.

Trick is the low% features are not valuable to spent time on -cleaning, scaling and standardising so you can either combine them to make a new feature and then again find feature importance scores or you can ignore them completely.

Once you have lesser features to deal with you can get better model scores and overall build more scalable models.

You can do the same on images as well like on MNIST or similar datasets.

If you’re looking for a Feature Engineering guide book I highly recommend Aurelion Geron since this article is but wine tasting compared to the entire topic.

Hope this hack helps you make quicker models with stronger confidence score.

FIN.