Feature Importance on Data

Published in

Super AI Engineer

5 min readMar 5, 2021

This article will explain about the basic of using feature importance methods before applying data in the model which can help improving the model accuracy.

Data

The data in this example is the well-known datasets: “Titanic — Machine Learning from Disaster”. (Classification Problems) https://www.kaggle.com/c/titanic

Reference methods from https://www.kaggle.com/imoore/titanic-the-only-notebook-you-need-to-see.

Clean Data

First of all, we need to explore and clean data as usual in order to filter out unrelated data.
The orginal datasets contain 12 features including ‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’, ‘Embarked’.

In this case, there are some null values in some features of the datasets, therefore, I fill in the data as follows:
1. Fill null data for ‘Age’ with mean values
2. Categorize ‘Age’ into 5 groups
3. Fill null data for ‘Fare’ with mean values
4. Categorize ‘Fare’ into 5 groups
5. Fill null data for ‘Embarked’ (port of embarked) by the highest number port of embark which is ‘S’
6. Mapping data for ‘Sex’ and ‘Embarked’

#import data
train= pd.read_csv('/kaggle/input/titanic/train.csv')#fill Age data columns with average data
train = train.fillna({"Age": train['Age'].mean()})#cut specifically define the bin edges. There is no guarantee about the distribution of items in each bin
train['CategoricalAge'] = pd.cut(train['Age'], 5)train.CategoricalAge.unique()#Categorized Age in to 5 groups
train.loc[ train['Age'] <= 16, 'Age'] = 0
train.loc[(train['Age'] > 16) & (train['Age'] <= 32), 'Age'] = 1
train.loc[(train['Age'] > 32) & (train['Age'] <= 48), 'Age'] = 2
train.loc[(train['Age'] > 48) & (train['Age'] <= 64), 'Age'] = 3
train.loc[ train['Age'] > 64, 'Age'] = 4 ;#fill Fare data columns with average data
train = train.fillna({"Fare": train['Fare'].mean()})train['CategoricalFare'] = pd.qcut(train['Fare'], 5) 
#qcut calculate the size of each bin in order to make sure the distribution of data in the bins is equaltrain.CategoricalFare.unique()#Categorized Fare in to 5 groups
train.loc[ train['Fare'] <= 7.854, 'Fare']  = 0
train.loc[(train['Fare'] > 7.854) & (train['Fare'] <= 10.5), 'Fare'] = 1
train.loc[(train['Fare'] > 10.5) & (train['Fare'] <= 21.679), 'Fare']   = 2
train.loc[(train['Fare'] > 21.679) & (train['Fare'] <= 39.688), 'Fare']   = 3
train.loc[(train['Fare'] > 39.688), 'Fare']   = 4train['Age'] = train['Age'].astype(int)
train['Fare'] = train['Fare'].astype(int)#fill null data in Embarked with highest port of embarktion 'S'
train = train.fillna({"Embarked": "S"})#Mapping string data into number in order to categorize data 
train['Sex'] = train['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
train['Embarked'] = train['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

Feature Extraction

Next step is to select or combine variables from original data to create new features in order to reduce dimensions of datasets. Another advantage of feature extraction is reducing the resources used for processing and increasing speed for calculating. These are new features which have been created.

‘Name_Count’ : Count the length of the name.
‘Has_Cabin’: Check that whether the passenger has cabin or not.
‘Family size’ : Create feature by combining ‘sibsp’ (number of siblings/spouses) and ‘parch’ (number of parents/children).
‘Solotravel’: Check that whether the passengen is a solo traveler.

#get the length of the name
train['Name_Count'] = train['Name'].apply(lambda x: len(x.split()))#check if passenger has cabin 
train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)#create new feature FamilySize as a combination of SibSp and Parch
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1#create new feature Solotraveler from FamilySize
train['Solotravel'] = 0
train.loc[train['FamilySize'] == 1, 'Solotravel'] = 1

Drop data that will not used for model prediction.

#Drop some columns that do not use for model prediction
train =train.drop(['PassengerId','Name','Ticket','Cabin','CategoricalAge','CategoricalFare'],axis=1)

Remark: There are 12 features after cleaning and extracting features including ‘Survived’, ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’, ‘Embarked’, ‘Name_Count’, ‘Has_Cabin’, ‘FamilySize’, ‘Solotravel’. (‘Survived’ will be exluded in training model since it is the expected result from the model)

X = train.loc[:, train.columns != 'Survived'].values
y = train.loc[:, 'Survived'].values

Feature Importance

In this example, I will use 2 different models to choose features that is importance for predicting the output including ‘Decision tree and Linear Regression’ Model.

Decision Tree Model

import matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 20)from sklearn.tree import DecisionTreeClassifier #Decision Tree
model = DecisionTreeClassifier()
model.fit(X_train, y_train)# get importance
importance = model.feature_importances_# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
    
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.axhline(y=0.05, color='r', linestyle='-')plt.show()#use only high important features to feed into a model
for i,v in enumerate(importance):
    if v >= 0.05:
        print('Feature: %0d, Score: %.5f' % (i,v))

Linear Regression Model

from sklearn.linear_model import LogisticRegression
model_l = LogisticRegression()
model_l.fit(X_train, y_train)# get importance
importance_l = model_l.coef_[0]
# summarize feature importance
for i,v in enumerate(importance_l):
  print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
plt.bar([x for x in range(len(importance_l))], importance_l)plt.axhline(y=0.3, color='r', linestyle='-')
plt.axhline(y=-0.3, color='r', linestyle='-')plt.show()#use only high important features to feed into a model
for i,v in enumerate(importance_l):
    if v >= 0.3 or v <= -0.3:
        print('Feature: %0d, Score: %.5f' % (i,v))

From both Decision Tree and Linear Regression model, it can be seen that there are some intersect important features. I choose the important features by using threshol shown as red lines in the above figure 1 and 2.
I will use all important features from Decision Tree Classifier for model prediction which including 8 out of 11 features from original features: ‘Pclass’, ‘Sex’, ‘Age’, ‘Fare’,’Embarked’, ‘Name_Count’, ‘Has_Cabin’, ‘FamilySize’.

Feature Selection

In this step, I will use Pearson Correlation Heatmap to check correlation between each of important features. If there are high correlation between features, we can filter it out to achieve dimensionality reduction of the model.

train_imp = train[['Pclass', 'Sex', 'Age', 'Fare','Embarked', 'Name_Count', 'Has_Cabin', 'FamilySize']] #8 important featuresplt.figure(figsize=(16,16))
g = sns.heatmap(train_imp.corr(method='spearman'),annot=True,cmap="RdYlGn")

corre = train_imp.corr(method='spearman')#low correlation features
corre_pairs = corre.unstack()
sorted_pairss = corre_pairs.sort_values(kind="quicksort")
pairss = sorted_pairss[sorted_pairss < 0.8]
print(pairss)#filter out if there are high correlation
strong_pairss = sorted_pairss[sorted_pairss > 0.8]
print(strong_pairss)

As the result shown in figure 3, there are no high correlation between each of important features, therefore, it shows that each feature contains its own unique characteristics. All of the 8 features will be used for predicting output.

These are all the steps we need to do to use Feature Importance methods on dataset. After that, we can use the selected features in our model.

Model Example

I will use XGBoost model in this example. Then, we can use the trained model on selected important features on test datasets.

from xgboost import XGBClassifier
import xgboost as xgb#Use only 8 important features
drop_elements = ['SibSp','Parch','Solotravel']
train_final = train.drop(drop_elements, axis = 1)XX = train_final.loc[:, train_final.columns != 'Survived'].values
yy = train_final.loc[:, 'Survived'].valuesXX_train, XX_test, yy_train, yy_test = train_test_split(XX, yy, test_size=0.2, random_state= 123)xgb_model = xgb.XGBClassifier(
 #learning_rate = 0.02,
 n_estimators= 2000,
 max_depth= 4,
 min_child_weight= 2,
 #gamma=1,
 gamma=0.9,                        
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread= -1,
 scale_pos_weight=1)xgb_model.fit(XX_train, yy_train)xgb_model.score(XX_test,yy_test)results = xgb_model.predict(XX_test)

I hope this article will be useful for anyone who want to apply this technique in machine learning or data science problems.