Data Analysis Practical Project based on Lending Club [2]

FreaxRuby
7 min readFeb 21, 2024

--

This practical project is based on the dataset from Lending Club [Dataset address: https://github.com/H-Freax/lendingclub_analyse]

This project is carried out in a Colab environment

Introduction

This data analysis practical project is divided into two parts. The first part mainly introduces the baseline method based on LightGBM, as well as three methods of adding derivative variables, finding four sets of derivative variables that can improve performance. The second part mainly introduces data analysis based on machine learning and deep learning network methods and practices integrating machine learning methods and the fusion of deep learning networks with machine learning methods.

Solving with Machine Learning Methods

Data Preparation

train_ML = df_train.copy()
test_ML = df_test.copy()

train_ML.fillna(0, inplace=True)
test_ML.fillna(0, inplace=True)

X_train = train_ML.drop(columns=['loan_status']).values
Y_train = train_ML['loan_status'].values.astype(int)
X_test = test_ML.drop(columns=['loan_status']).values
Y_test = test_ML['loan_status'].values.astype(int)

Machine Learning Methods

Random Forest

from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators = 100, random_state = 20)
rnd_clf.fit(X_train, Y_train)
rnd_clf.score(X_test, Y_test)

0.9164

SGDClassifier Stochastic Gradient Descent

from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=20) #random_state for reproducibility
sgd_clf.fit(X_train, Y_train)
sgd_clf.score(X_test, Y_test)

0.8639

Logistic Regression

from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression(random_state = 20)
lr_clf.fit(X_train, Y_train)
lr_clf.score(X_test, Y_test)

0.9111

GBDT

from sklearn.ensemble import GradientBoostingClassifier
gdbt_clf = GradientBoostingClassifier(random_state = 20)
gdbt_clf.fit(X_train, Y_train)
gdbt_clf.score(X_test, Y_test)

0.91772

from sklearn.model_selection import cross_val_predict
y_train_pred=cross_val_predict(gdbt_clf, X_train, Y_train, cv=3)
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

conf_mx=confusion_matrix(Y_train, y_train_pred)
conf_mx

plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()
conf_mx

array([[ 8271, 1941], [ 2098, 37690]])

SVM Support Vector Machine Classifier

from sklearn.svm import SVC
svm_clf = SVC()
svm_clf.fit(X_train, Y_train)
svm_clf.score(X_test, Y_test)

0.80448

AdaBoost Classifier

from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier()
ada_clf.fit(X_train, Y_train)
ada_clf.score(X_test, Y_test)

0.91604

LightGBM

from lightgbm import LGBMClassifier
lgbm_clf = LGBMClassifier()
lgbm_clf.fit(X_train, Y_train)
lgbm_clf.score(X_test, Y_test)

0.91768

XGB Classifier

from xgboost import XGBClassifier #XGB Classifier
xgb_clf = XGBClassifier()
xgb_clf.fit(X_train, Y_train)
xgb_clf.score(X_test, Y_test)

0.91712

Naive Bayes Classifier

from sklearn.naive_bayes import GaussianNB
nby_clf = GaussianNB()
nby_clf.fit(X_train, Y_train)
nby_clf.score(X_test, Y_test)

0.90478

K-Nearest Neighbors Classifier

from sklearn.neighbors import KNeighborsClassifier
knc_clf = KNeighborsClassifier()
knc_clf.fit(X_train, Y_train)
knc_clf.score(X_test, Y_test)

0.84852

Ensemble

Voting Fusion Method

from sklearn.ensemble import VotingClassifier      #Voting Classifier

voting_clf = VotingClassifier(estimators=[('rf',

rnd_clf),('gdbt', gdbt_clf),('ada', ada_clf),('lgbm', lgbm_clf),('xgb', xgb_clf)],#estimators: sub-classifiers
voting='hard') #voting parameter represents the voting method, hard, soft


# Train the model, output accuracy of each model
from sklearn.metrics import accuracy_score
for clf in (lr_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, Y_train)
y_pre = clf.predict(X_test)
print(clf.__class__, accuracy_score(y_pre, Y_test))

Output results

<class 'sklearn.linear_model._logistic.LogisticRegression'> 0.91108
<class 'sklearn.ensemble._forest.RandomForestClassifier'> 0.9164
<class 'sklearn.svm._classes.SVC'> 0.80448
<class 'sklearn.ensemble._voting.VotingClassifier'> 0.91814

If all classifiers can estimate class probabilities (i.e., they all have a predict_proba() method in sklearn), then the average value of class probabilities can be calculated, and the voting classifier will take the class with the highest probability as its prediction. This is called soft voting. Only two changes are needed in the code: in the support vector machine, set the parameter probability to True to enable the prediction of class probabilities. In the voting classifier, set voting to soft.

#soft voting
svm_clf1 = SVC(probability=True)
voting_clf = VotingClassifier(estimators=[('lf', lr_clf),('svc', svm_clf1),('rf', rnd_clf)],
voting='soft')
for clf in (lr_clf, rnd_clf, svm_clf1, voting_clf):
clf.fit(X_train, Y_train)
y_pre = clf.predict(X_test)
print(clf.__class__, accuracy_score(y_pre, Y_test))

Output results

<class 'sklearn.linear_model._logistic.LogisticRegression'> 0.91108
<class 'sklearn.ensemble._forest.RandomForestClassifier'> 0.9164
<class 'sklearn.svm._classes.SVC'> 0.80448
<class 'sklearn.ensemble._voting.VotingClassifier'> 0.91664

Normally, soft voting tends to improve results, but in this fusion, the effect decreased.

Stacking

Stacking is an ensemble learning technique that uses predictions from multiple models (e.g., decision trees, knn, or svm) to build a new model. This model is used to make predictions on the test set. Here is a step-by-step explanation of simple stacking ensemble:

  1. Divide the training set into 10 groups.
  2. A base model (e.g., decision tree) is trained on 9 of the groups and predicts on the 10th group.
  3. Then, the base model (e.g., decision tree) is fitted to the entire training dataset.
  4. Use this model to make predictions on the test set.
  5. Repeat steps 2 to 4 for another base model (e.g., knn), producing another set of predictions for the training and test sets.
  6. The predictions for the training set are used as features to build a new model.
  7. This model is used to make the final predictions on the test prediction set.
from sklearn.model_selection import StratifiedKFold
def Stacking(model, train, y, test, n_fold):
folds=StratifiedKFold(n_splits=n_fold, random_state=1)
test_pred=np.empty((test.shape[0], 1), float)
train_pred=np.empty((0, 1), float)
for train_indices, val_indices in folds.split(train, y.values):

x_train, x_val=train.iloc[train_indices], train.iloc[val_indices]
y_train, y_val=y.iloc[train_indices], y.iloc[val_indices]

model.fit(X=x_train, y=y_train)
train_pred=np.append(train_pred, model.predict(x_val))
test_pred=np.column_stack((test_pred, model.predict(test)))
test_pred_mean=np.mean(test_pred, axis=1) #calculate mean by row
return test_pred_mean.reshape(-1, 1), train_pred

Use gdbt and lgbm for the first layer

x_train=train_ML.drop(columns=['loan_status'])
x_test=test_ML.drop(columns=['loan_status'])
y_train=train_ML['loan_status']

test_pred1, train_pred1=Stacking(model=gdbt_clf, n_fold=10, train=x_train, test=x_test, y=y_train)
print(test_pred1.size)
train_pred1=pd.DataFrame(train_pred1)


test_pred1=pd.DataFrame(test_pred1)

test_pred2, train_pred2=Stacking(model=lgbm_clf, n_fold=10, train=x_train, test=x_test, y=y_train)
print(test_pred2.size)
train_pred2=pd.DataFrame(train_pred2)
test_pred2=pd.DataFrame(test_pred2)

Use Random Forest for the second layer

dff = pd.concat([train_pred1, train_pred2], axis=1)
dff_test = pd.concat([test_pred1, test_pred2], axis=1)

rnd_clf.fit(dff, y_train)
rnd_clf.score(dff_test, Y_test)

0.91798

Blending

Blending follows the same method as stacking but only uses the reserve/(validation) set from the training set for predictions. In other words, unlike stacking, predictions are made only on the reserve set. The reserve set and its predictions are used to build a model, which is tested on the test set. Here is a detailed explanation of the blending process:

  1. The original training set is split into a training set and a validation set.
  2. Fit the model to the training set.
  3. Make predictions on the validation and test sets.
  4. The validation set and its predictions are used as features to build a new model.
  5. This model is used for the final predictions on the test set and meta-features.

In the same order

First use gdbt and lgbm Then use Random Forest

x_train=train_ML.drop(columns=['loan_status'])
x_test=test_ML.drop(columns=['loan_status'])
y_train=train_ML['loan_status']

val_pred1 = gdbt_clf.predict(x_train)
test_pred1 = gdbt_clf.predict(x_test)
val_pred1 = pd.DataFrame(val_pred1)
test_pred1 = pd.DataFrame(test_pred1)


val_pred2 = lgbm_clf.predict(x_train)
test_pred2 = lgbm_clf.predict(x_test)
val_pred2 = pd.DataFrame(val_pred2)
test_pred2 = pd.DataFrame(test_pred2)

df2_val = pd.concat([x_train, val_pred1, val_pred2], axis=1)
df2_test = pd.concat([x_test, test_pred1, test_pred2], axis=1)

rnd_clf.fit(df2_val, y_train)
rnd_clf.score(df2_test, Y_test)

0.91668

Deep Learning Network

DNN

Data Preparation

train_DL = df_train.copy()
test_DL = df_test.copy()
train_DL.fillna(0, inplace=True)
test_DL.fillna(0, inplace=True)

X_train = train_DL.drop(columns=['loan_status']).values
Y_train = train_DL['loan_status'].values.astype(int)
X_test = test_DL.drop(columns=['loan_status']).values
Y_test = test_DL['loan_status'].values.astype(int)

from tensorflow.keras.utils import to_categorical
Y_test=to_categorical(Y_test, 2).astype(int)
Y_train=to_categorical(Y_train, 2).astype(int)

Building the Network


import keras as K
from keras.layers.core import Dropout
init = K.initializers.glorot_uniform(seed=1)
model = K.models.Sequential()
model.add(K.layers.Dense(units=146, input_dim=145, kernel_initializer=init, activation='relu'))
model.add(K.layers.Dense(units=147, kernel_initializer=init, activation='relu'))
model.add(K.layers.Dense(units=2, kernel_initializer=init, activation='softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'])

b_size = 128
max_epochs = 100
print("Starting training ")

h = model.fit(X_train, Y_train, batch_size=b_size, epochs=max_epochs, shuffle=True, verbose=1)
print("Training finished \n")

Test Results

eval = model.evaluate(X_test, Y_test, verbose=0)
print("Evaluation on test data: loss = %0.6f accuracy = %0.2f%% \n" \
% (eval[0], eval[1] * 100) )

Evaluation on test data: loss = 0.244760 accuracy = 90.52%

Deep Learning Network DNN+Trick (adam)

Data Preparation

train_DL = df_train.copy()
test_DL = df_test.copy()
train_DL.fillna(0, inplace=True)
test_DL.fillna(0, inplace=True)

X_train = train_DL.drop(columns=['loan_status']).values
Y_train = train_DL['loan_status'].values.astype(int)
X_test = test_DL.drop(columns=['loan_status']).values
Y_test = test_DL['loan_status'].values.astype(int)

from tensorflow.keras.utils to_categorical
Y_test=to_categorical

(Y_test, 2).astype(int)
Y_train=to_categorical(Y_train, 2).astype(int)

Building the Network


import keras as K
from keras.layers.core import Dropout
init = K.initializers.glorot_uniform(seed=1)
simple_adam = K.optimizers.Adam() #trick added adam
model = K.models.Sequential()
model.add(K.layers.Dense(units=146, input_dim=145, kernel_initializer=init, activation='relu'))
# model.add(Dropout(0.1)) #using dropout did not improve the results
model.add(K.layers.Dense(units=147, kernel_initializer=init, activation='relu'))
# model.add(Dropout(0.9))
model.add(K.layers.Dense(units=2, kernel_initializer=init, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=simple_adam, metrics=['accuracy'])

b_size = 128
max_epochs = 100
print("Starting training ")

h = model.fit(X_train, Y_train, batch_size=b_size, epochs=max_epochs, shuffle=True, verbose=1)
print("Training finished \n")

Test Results

eval = model.evaluate(X_test, Y_test, verbose=0)
print("Evaluation on test data: loss = %0.6f accuracy = %0.2f%% \n" \
% (eval[0], eval[1] * 100) )

Evaluation on test data: loss = 0.214410 accuracy = 91.21%

Deep Learning Network DNN+Trick (SGD)

Data Preparation

train_DL = df_train.copy()
test_DL = df_test.copy()
train_DL.fillna(0, inplace=True)
test_DL.fillna(0, inplace=True)

X_train = train_DL.drop(columns=['loan_status']).values
Y_train = train_DL['loan_status'].values.astype(int)
X_test = test_DL.drop(columns=['loan_status']).values
Y_test = test_DL['loan_status'].values.astype(int)

from tensorflow.keras.utils to_categorical
Y_test=to_categorical(Y_test, 2).astype(int)
Y_train=to_categorical(Y_train, 2).astype(int)

Building the Network


import keras as K
from keras.layers.core import Dropout
init = K.initializers.glorot_uniform(seed=1)
simple_adam = K.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-06) #trick added RMSprop
model = K.models.Sequential()
model.add(K.layers.Dense(units=146, input_dim=145, kernel_initializer=init, activation='relu'))
# model.add(Dropout(0.1)) #using dropout did not improve the results
model.add(K.layers.Dense(units=147, kernel_initializer=init, activation='relu'))
# model.add(Dropout(0.9))
model.add(K.layers.Dense(units=2, kernel_initializer=init, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=simple_adam, metrics=['accuracy'])

b_size = 128
max_epochs = 100
print("Starting training ")

h = model.fit(X_train, Y_train, batch_size=b_size, epochs=max_epochs, shuffle=True, verbose=1)
print("Training finished \n")

Test Results

eval = model.evaluate(X_test, Y_test, verbose=0)
print("Evaluation on test data: loss = %0.6f accuracy = %0.2f%% \n" \
% (eval[0], eval[1] * 100) )

Evaluation on test data: loss = 0.237782 accuracy = 91.39%

TabNet

Environment Import

pip install pytorch-tabnet

Data Preparation

train_DL = df_train.copy()
test_DL = df_test.copy()
train_DL.fillna(0, inplace=True)
test_DL.fillna(0, inplace=True)

X_train = train_DL.drop(columns=['loan_status']).values
Y_train = train_DL['loan_status'].values.astype(int)
X_test = test_DL.drop(columns=['loan_status']).values
Y_test = test_DL['loan_status'].values.astype(int)

Building the Network

from pytorch_tabnet.tab_model import TabNetClassifier, TabNetRegressor

clf = TabNetClassifier() #TabNetRegressor()
clf.fit(
X_train, Y_train
)
preds = clf.predict(X_test)

Test Results

accuracy_score(Y_test, preds)

0.9115

Deep Learning Network Integrated with Machine Learning

Stacking Ensemble DNN

train_DL = df_train

.copy()
test_DL = df_test.copy()
train_DL.fillna(0, inplace=True)
test_DL.fillna(0, inplace=True)

X_train = train_DL.drop(columns=['loan_status']).values
Y_train = train_DL['loan_status'].values.astype(int)
X_test = test_DL.drop(columns=['loan_status']).values
Y_test = test_DL['loan_status'].values.astype(int)

from tensorflow.keras.utils to_categorical
Y_test=to_categorical(Y_test, 2).astype(int)
Y_train=to_categorical(Y_train, 2).astype(int)

import keras as K
from keras.layers.core import Dropout
init = K.initializers.glorot_uniform(seed=1)
model = K.models.Sequential()
model.add(K.layers.Dense(units=146, input_dim=2, kernel_initializer=init, activation='relu'))
model.add(K.layers.Dense(units=147, kernel_initializer=init, activation='relu'))
model.add(K.layers.Dense(units=2, kernel_initializer=init, activation='softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'])
x_train=train_ML.drop(columns=['loan_status'])
x_test=test_ML.drop(columns=['loan_status'])
y_train=train_ML['loan_status']

test_pred1, train_pred1=Stacking(model=gdbt_clf, n_fold=10, train=x_train, test=x_test, y=y_train)
print(test_pred1.size)
train_pred1=pd.DataFrame(train_pred1)
test_pred1=pd.DataFrame(test_pred1)

test_pred2, train_pred2=Stacking(model=lgbm_clf, n_fold=10, train=x_train, test=x_test, y=y_train)
print(test_pred2.size)
train_pred2=pd.DataFrame(train_pred2)
test_pred2=pd.DataFrame(test_pred2)
dff = pd.concat([train_pred1, train_pred2], axis=1)
dff_test = pd.concat([test_pred1, test_pred2], axis=1)

model.fit(dff, y_train)
eval = model.evaluate(dff_test, Y_test, verbose=0)
print("Evaluation on test data: loss = %0.6f accuracy = %0.2f%% \n" \
% (eval[0], eval[1] * 100) )

Results

1563/1563 [==============================] - 4s 2ms/step - loss: 0.2892 - accuracy: 0.9029
Evaluation on test data: loss = 0.261336 accuracy = 91.83%

--

--