Feature Selection (Boruta /Light GBM/Chi Square)-Categorical Feature Selection

Indresh Bhattacharyya
5 min readSep 12, 2018

--

google Image

Feature Selection is an important concept in the Field of Data Science. Specially when it comes to real life data the Data we get and what we are going to model is quite different.

Why Feature Selection?

This answer has been given by many people. Still I will give a Brief Realistic answer. Think of it in this way. Say we have 90 columns in a Raw dataset(Some thing we got from a company).

Say we have a bank Dataset and our ultimate goal is to find if the customer is going to churn or not. The columns may be having Customer details + Something like products purchased by the Customer in some Mall(This does not happen but for arguments sake think it in this way).

How is the “products purchased by the Customer in some Mall” going to help us predict if the customer is going to Churn Or Not. It is not. But do we know that? How can we say that in certain(if we are not well versed in this specific Domain). Feeding our model vague data would only decrease the proficiency of the Model. That is where Feature Selection and importance comes in.

NOTE: One thing everyone might be thinking about is we know how to perform feature selection(using R² or adjusted R² or some other methods) but one thing I found was info on “Categorical” Feature selection was very limited. And that is the actual focus of this Blog. “CATEGORICAL FEATURE SELECTION”

Why not use Dummy variable concept and do Feature Selection?

Here is why not.

  1. Say we have a column Country. And there are almost 150 unique categories in the column . So making 149 Dummy columns with a sparse matrix to find the feature importance seems like a dumb idea
  2. Then again when each column is representing an specific country now feature importance measures like R² may suggest to remove a country column completely which again is a bad idea. Because instead of saying this feature is Good or bad we are actually saying this category in the feature is good or bad.

So The Solution:

We are going to use algorithms that treat categorical features as category as a whole Instead of using OHE concept.

Note: There are multiple methods of doing Feature Selection . We are just going to Focus on the Wrapper methods. For all the other methods check the following Series

Boruta:

The Boruta algorithm is a wrapper built around the random forest classification algorithm. It tries to capture all the important, interesting features you might have in your data set with respect to an outcome variable.

Methodology:

  1. First it creates randomness to the features by creating duplicate features and shuffle the values in each column. This features are called Shadow Features.
  2. Trains a classifier (Random Forest) on the Dataset and calculate the importance using Mean Decrease Accuracy or Mean Decrease Impurity.
  3. Then, the algorithm checks for each of your real features if they have higher importance. That is, whether the feature has a higher Z-score than the maximum Z-score of its shadow features than the best of the shadow features.
  4. At every iteration, the algorithm compares the Z-scores of the shuffled copies of the features and the original features to see if the latter performed better than the former. If it does, the algorithm will mark the feature as important.

For more information Read this Blog .He created the package for python

There is an implementation in Python borutaPy

Install using :

pip install Boruta

Algorithm:

from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
rfc = RandomForestClassifier(n_estimators=1000, n_jobs=-1, class_weight=’balanced’)
boruta_selector = BorutaPy(rfc, n_estimators=’auto’, verbose=2)
x=df.iloc[:,:].values
y=dflabel.iloc[:,0].values
boruta_selector.fit(x,y)

print(“==============BORUTA==============”)
print (boruta_selector.n_features_)

The next we are going to use is LightGBM :

Light GBM:

Light GBM is a gradient boosting framework that uses tree based learning algorithm. Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. I am here to tell how to find the feature importance.

For more see this:

Algorithm:

import lightgbm as lgb

d_train = lgb.Dataset(df, label=dflabel)

param = {“max_depth”: 5, “learning_rate” : 0.1, “num_leaves”: 900, “n_estimators”: 100}

model2 = lgb.train(params=param,train_set=d_train,categorical_feature=catFeature)

print(‘Plot feature importances…’)

ax = lgb.plot_importance(model2, max_num_features=10)

plt.show()

Lets understand the Whole thing now:

d_train = lgb.Dataset(df, label=dflabel)

Here we are converting the df →Feature DataFrame and dflabel → label DataFrame in to Dataset(Which is lightGBM internal supported Format) format

param = {“max_depth”: 5, “learning_rate” : 0.1, “num_leaves”: 900, “n_estimators”: 100}

This is Just setting up the Parameters for the LightGBM tree

model2 = lgb.train(params=param,train_set=d_train,categorical_feature=catFeature)

Here is Something Important

lgb.train(params=,train_set=,categorical_feature=)

catFeature is a list of categorical column names.So we are passing this list to the “categorical_feature=” parameter telling it that this is our Categorical features.

NOTE: LightGBM has support for categorical features but the input should be integers not strings. Like if You have ‘Cats’ and ‘Dogs’ as categorical value. You should LabelEncode it in like Cats==1 and Dogs==2. And if the column name is say ‘AnimalType’ just cast it as ‘Category’ like this df[‘AnimalType’].astype(‘category’)

model2.feature_importance()

Show you the feature importance in a Matrix

model2.feature_name()

Show you the Feature name

for f in range(len(model2.feature_importance())):
val=list(model2.feature_importance())[f]
if val!=0:
print(‘rank:’,val,list(model2.feature_name())[list(model2.feature_importance()).index(val)])

This part will just show you feature names in form of Ranks.

Chi Square:

Chi Square is a Feature Selection Algorithm. But this is not a Wrapper method as earlier algorithms like Boruta or LightGBM. The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.

What it does basically is?

Say we have a Table

Chi Square table

Lets look at the formulas

Calculation

Expected for person1 is expected=30*(25/95)

And thus Person1 → product A we have (10-expected)² /expected

For more in depth look at

Implementation:

def ChiSquare(df,featureList,label,alpha=0.05):
for category in featureList:
ct=pd.crosstab(df[category],df[label])
#print(stat.chi2_contingency)
chi_square_value,p_value,_,_=stat.chi2_contingency(ct)
if p_value <=alpha:
print(category,’is Important with p Value:’,p_value)
else:
print(category,’is Not Important with p Value:’,p_value)

--

--