Bank Data: Classification Part 3

Zaki Jefferson
Analytics Vidhya
Published in
3 min readSep 30, 2020

This blog is part 3 out of 4 and we will be discussing Boosting.

Gradient Boosting

Gradient Boosting is a machine learning technique for classification and regression used to turn weak learners to strong learners.

gb_clf = GradientBoostingClassifier()# Grid Search
param_gb = {"n_estimators":[100, 300, 500], "max_depth":[3, 5]}
grid_gb = GridSearchCV(gb_clf, param_grid=param_gb)
grid_gb.fit(X_train_new, y_train_new)grid_gb.cv_results_

We continue to use grid search and take a look at the cross validation performance. Looking at the results above, we should be obtaining a test score of around 76%

Confusion Matrix

# Confusion Matrixprint('Confusion Matrix - Testing Dataset')print(pd.crosstab(y_test, grid_gb.predict(X_test), rownames=['True'], colnames=['Predicted'], margins=True))
confusion_matrix_metrics(TN=719, FP=3790, FN=829, TP=10166, P=10995, N=4509)

The confusion matrix above shows an F1 Score of 81%, where Recall is the main contributor.

Feature Importance

# Graphing
fig, ax = plt.subplots(figsize=(15, 10))
ax.barh(width=gb_clf.feature_importances_, y=X_train_new.columns)

By using feature importance we can see what features our Gradient Boosting model believes to be paramount. The main distinction between Gradient Boosting and Random Forest is that Gradient Boosting is more specific/clear when deciding which features are important; there is no thin line, the distinction between important features is clear as day.

Gradient Boosting on Important Features

We will be selecting a threshold and then performing feature selection to perform Gradient Boosting on our new features.

# Selecting the top features at a cap of 0.08
gb_important_features = np.where(gb_clf.feature_importances_ > 0.08)
print(gb_important_features)
print(len(gb_important_features[0])) # Number of features that qualify
# Extracting the top feature column names
gb_important_feature_names = [columns for columns in X_train_new.columns[gb_important_features]]
gb_important_feature_names

The code above shows important features above our threshold of 0.08. These will be the features that will be selected and tested.

# Creating new training and testing data with top features
gb_important_train_features = X_train_new[gb_important_feature_names]
gb_important_test_features = X_test[gb_important_feature_names]

The code above only includes the selected features in the dataframe.

We will now run through our training data with a grid-search and look at the results:

param_gb = {"n_estimators":[100, 500, 700], "max_depth":[3, 7, 9]}# Grid search
grid_gb = GridSearchCV(gb_clf, param_grid=param_gb)
grid_gb.fit(gb_important_train_features, y_train_new)grid_gb.cv_results_

The results of the cross validation shows that we are bound to get a score of around 72%.

Confusion Matrix for Important Features

# Confusion Matrix
print('Confusion Matrix - Testing Dataset')
print(pd.crosstab(y_test, grid_gb.predict(gb_important_test_features), rownames=['True'], colnames=['Predicted'], margins=True))
confusion_matrix_metrics(TN=593, FP=3916, FN=825, TP=10170, P=10995, N=4509)

By looking back, we can see that feature selection showed little to no difference in our F1 Score results.

--

--

Zaki Jefferson
Analytics Vidhya

Data Scientist | Data Science Consultant. I work with companies and individuals to help leverage the abundance of data to help grow their ideas and business!